Semester Award Granted
Spring 2025
Submission Date
May 2025
Document Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Thesis/Dissertation Advisor [Chair]
Xingquan Zhu
Abstract
Biomedical data analytics is a broad interdisciplinary field that has seen vast amounts of data collected with recent technological advancements. Computational studies are required for time and cost-efficient methods to characterize, summarize, and interpret these datasets to advance public health. This thesis presents research studies done with predictive and generative models with regards to subdomains in clinical research, healthcare and bioinformatics. Our research with predictive modeling presents (1) extensive feature engineering methods, ensemble models and feature selection ranking in order to provide data-driven models to evaluate relevant evidence for clinical decision making; and (2) network analysis and unsupervised link prediction methods for clinical trial recommendation. With generative modeling, we present usage of a generative large language model (LLM) to predict novel mutation prediction. Our studies present a variety of methods to improve common classification algorithms on biomedical data and novel methods of representing, classifying and generating biomedical data.
A summary of the contributions of this thesis are as follows:
• We use machine learning methods to investigate prematurely ended clinical trials and address two fundamental questions: (1) what are common factors/markers associated to terminated clinical trials? and (2) how to accurately predict whether a clinical trial may be terminated or not? We introduce extensive feature engineering methods and feature ranking methods to address the first question. For the second question, we train multiple machine learning models with random undersampling and ensemble methods to handle the class imbalance problem leading to satisfactory prediction results which can directly estimate the chance of success of a clinical trial in order to minimize costs. We conduct these studies first with a global (non-disease specific) clinical trial dataset and conduct a COVID-19 clinical trial ablation study with only COVID-19 clinical trials. Results from the global clinical trial study give insights to different research areas as well as general trial attributes that lead to a higher risk of prematurely ending. Results from the COVID-19 ablation study give insights to specific interventions that lead to higher risk of trials prematurely ending.
• We present predictive COVID-19 diagnostic models that predict COVID-19 positive test results from basic symptom and demographic information. We investigate which types of symptoms are highly informative of COVID-19 prediction and show symptom based diagnostic models can predict COVID-19 positive test results. We also investigate the correlation of different COVID-19 diagnostic tests. Outcomes of this study provide information on relationships of COVID-19 diagnostic tests and show that symptom data can be used to predict COVID-19 test results.
• We introduce network analytics of infectious disease clinical trials in order to characterize and understand infectious disease research. We model infectious disease clinical trials as a bipartite network of sponsors and research areas. The network allows us to analyze trends in infectious disease research as well as create a clinical research information recommendation system. Since the network can be sparse, we identify sponsor communities and research area topics. We present a new link prediction method, that utilizes communities and topics, allowing us to recommend related research areas to sponsors.
Recommended Citation
Elkin, Magdalyn E., "PREDICTIVE AND GENERATIVE MODELING FOR BIOMEDICAL DATA ANALYTICS" (2025). Electronic Theses and Dissertations. 76.
https://digitalcommons.fau.edu/etd_general/76