Author Type

Graduate Student

Date of Award

Fall 11-5-2025

Document Type

Dissertation

Publication Status

Version of Record

Submission Date

December 2025

Department

Computer and Electrical Engineering and Computer Science

Degree Name

Doctor of Philosophy (PhD)

Thesis/Dissertation Advisor [Chair]

Taghi M. Khoshgoftaar

Abstract

In today’s landscape, copious amounts of unlabeled data continue to be generated. This data has the potential to accelerate machine learning research; however, supervised methods require labels and unsupervised methods often require expert fine-tuning to be reliable, both of which can impose significant cost. In addition to not requiring labels, another benefit of unsupervised learning is the protection of privacy since it does not require human annotation. In addition, class imbalance, where one class has significantly more instances, can complicate model training and reduce performance. Because of these challenges, automated unsupervised methods can offer a path forward to further machine learning research. The primary objective of this dissertation is to develop a novel method for determining the class distribution of an unlabeled dataset, along with a fully automated and unsupervised class labeling framework. We validate our methods across a diverse set of real-world tabular datasets that vary widely in domain, class distribution, feature dimensionality, and size, including challenging applications such as fraud detection and cognitive assessment.

Our unique approach involves the combination of two labeling strategies, an unsupervised ensemble and percentile-threshold based methods, that create a high-confidence set of labels which ultimately determine a single positive or negative label for each instance in the dataset based on the expected number of positives. We further improve label quality and efficiency by integrating unsupervised feature selection to rank and identify the most informative features. Unsupervised feature selection simplifies the model and reduces computational complexity, making the method well-suited for large-scale, severely imbalanced datasets (e.g., Medicare and credit card fraud). Moreover, we enhance our labeling method by introducing an unsupervised framework that automatically estimates the class distribution. Using this estimate, the framework selects decision thresholds adaptively, thereby improving label quality. Our novel approach relies exclusively on the dataset’s own features for labeling, requiring no external labels or manual annotations. This makes the method fully automated and unsupervised. We detail empirical results demonstrating substantial improvements in label quality, both across refinements of the method (e.g., progressing from unsupervised to automated unsupervised approaches) and in comparison to an unsupervised baseline learner. These results highlight the effectiveness of our novel class distribution estimation and class label generation methods when applied to unlabeled data.

Share

COinS