Author Type

Graduate Student

Date of Award

Fall 12-9-2025

Document Type

Thesis

Publication Status

Version of Record

Submission Date

December 2025

Department

Computer and Electrical Engineering and Computer Science

College Granting Degree

College of Engineering and Computer Science

Department Granting Degree

Electrical Engineering and Computer Science

Degree Name

Doctor of Philosophy (PhD)

Thesis/Dissertation Advisor [Chair]

Taghi M. Khoshgoftaar

Abstract

In today’s data-driven landscape, large volumes of data are generated continuously, often containing imperfections such as noise, missing data, or unreliable labeling. These real-world datasets are typically high-dimensional, sparsely labeled, and imbalanced, creating substantial challenges for both supervised and unsupervised learning. These challenges are especially prevalent in the task of anomaly detection, where instances belonging to the class of interest are rare and underrepresented compared to normal instances. This dissertation proposes and evaluates robust frameworks for anomaly detection that address these data challenges and improve model performance and robustness using real-world datasets, including credit card transactions and cognitive assessments. Supervised learning requires labeled data which can be costly, hard to produce, and prone to mislabeling. We propose a reconstruction error–based method to identify and correct mislabeled samples, thereby improving the quality of labeled data. To address imbalance and high dimensionality, we combine deep feature extraction using convolutional autoencoders, an unsupervised learning technique, with class rebalancing strategies to improve classification performance. Then, we examine how the order of preprocessing steps affects downstream ensemble learners. For unlabeled data, we propose a novel hybrid unsupervised framework that integrates convolutional autoencoders for representation learning with Isolation Forest for anomaly detection (CAE-IF). CAE–IF demonstrates robust performance on unlabeled, high-dimensional, and imbalanced data across cognitive and fraud detection domains, relative to common baselines such as Isolation Forest and Local Outlier Factor. In addition, we apply an instance-based iterative cleaning method that uses reconstruction error to remove likely outliers and improves representation quality for downstream detection without requiring manual annotation. The results demonstrate that our proposed approaches improve model robustness in various imperfect data conditions. Collectively, these contributions provide a practical and generalizable toolkit for anomaly detection, addressing the core challenges of class imbalance, label noise, and label scarcity across both supervised and unsupervised settings.

Share

COinS