Date of Award
Spring 4-23-2026
Document Type
Thesis
Publication Status
Version of Record
Submission Date
May 2026
Department
Computer and Electrical Engineering and Computer Science
College Granting Degree
College of Engineering and Computer Science
Department Granting Degree
Electrical Engineering and Computer Science
Degree Name
Master of Science (MS)
Thesis/Dissertation Advisor [Chair]
Xingquan (Hill) Zhu
Abstract
The first step of biomedical NLP is recognizing clinical named entities, which consist of identifying and categorizing a variety of clinical entities such as diseases, symptoms, genetics, diagnostic tests, procedures, etc. from a body of unstructured clinical text. This study presents a PubMed and UMLS based Retrieval Augmented Generation framework which improves the performance of the Large Language Models to identify clinical entities by providing context. In particular, the framework consists of a two-stage pipeline, where candidate tokens are identified from initial LLM-based classification and refined with retrieved context from either PubMed or UMLS. The proposed framework is assessed across two established biomedical datasets, the NCBI Disease Corpus (binary classification) and MedMentions (multi class classification) and assessed using three LLMs, LLaMA-70B, Qwen-35B, and GPT-5. The results of the evaluation indicate that retrieval-based on PubMed-source consistently improved or maintained F1 scores. Therefore, the results indicate that retrieval-source selection is a critical aspect of retrieval-augmented generation biomedical NLP systems.
Recommended Citation
Tripathi, Apoorv, "LLM FOR CLINICAL NAMED ENTITY RECOGNITION: A STUDY ON RAG WITH PUBMED and UMLS" (2026). Electronic Theses and Dissertations. 345.
https://digitalcommons.fau.edu/etd_general/345