Author Type

Graduate Student

Date of Award

Spring 4-23-2026

Document Type

Thesis

Publication Status

Version of Record

Submission Date

May 2026

Department

Computer and Electrical Engineering and Computer Science

College Granting Degree

College of Engineering and Computer Science

Department Granting Degree

Electrical Engineering and Computer Science

Degree Name

Master of Science (MS)

Thesis/Dissertation Advisor [Chair]

Xingquan (Hill) Zhu

Abstract

The first step of biomedical NLP is recognizing clinical named entities, which consist of identifying and categorizing a variety of clinical entities such as diseases, symptoms, genetics, diagnostic tests, procedures, etc. from a body of unstructured clinical text. This study presents a PubMed and UMLS based Retrieval Augmented Generation framework which improves the performance of the Large Language Models to identify clinical entities by providing context. In particular, the framework consists of a two-stage pipeline, where candidate tokens are identified from initial LLM-based classification and refined with retrieved context from either PubMed or UMLS. The proposed framework is assessed across two established biomedical datasets, the NCBI Disease Corpus (binary classification) and MedMentions (multi class classification) and assessed using three LLMs, LLaMA-70B, Qwen-35B, and GPT-5. The results of the evaluation indicate that retrieval-based on PubMed-source consistently improved or maintained F1 scores. Therefore, the results indicate that retrieval-source selection is a critical aspect of retrieval-augmented generation biomedical NLP systems.

Share

COinS