Date of Award
Spring 5-13-2026
Document Type
Thesis
Publication Status
Version of Record
Submission Date
May 2026
Department
Computer and Electrical Engineering and Computer Science
College Granting Degree
College of Engineering and Computer Science
Department Granting Degree
Electrical Engineering and Computer Science
Degree Name
Master of Science (MS)
Thesis/Dissertation Advisor [Chair]
Hari Kalva
Abstract
This thesis investigates the performance of Video Coding for Machines (VCM) with Vision Transformer based object detection models. While existing VCM studies and tool designs have largely been developed under CNN-based assumptions, recent advances in computer vision have shown the growing importance of transformer based models. Motivated by this shift, this work studies whether VCM compressed data remains suitable for Vision Transformer based inference in addition to conventional CNN-based task networks.
To address this problem, three representative transformer based object detection models were selected: DETR, SWIN, and YOLOS. These models were chosen to represent different architectural styles, namely a CNN backbone with a transformer detector, a hierarchical transformer backbone with a conventional detection framework, and a pure Vision Transformer based detector. A VCM compressed dataset was generated from SFU video sequences under six configurations: Random Access Inner, Random Access E2E, Low Delay Inner, Low Delay E2E, All Intra Inner, and All Intra E2E. Object detection performance was evaluated using mean Average Precision (mAP), along with absolute and relative performance differences with respect to uncompressed data.
The results show that VCM compressed data remains suitable for transformer based object detection models. Across the evaluated coding configurations, detection performance generally improved as bitrate increased, indicating that the information preserved by the VCM pipeline remains useful for transformer based inference. Among the evaluated transformer-based models, SWIN showed the strongest and most consistent overall performance across most sequences and configurations. In addition to the main comparative analysis, this thesis also carried out an exploratory sequence level study using object count, object size, and frame variance. These parameters were useful for describing content characteristics, although none of them alone showed a strict or universal relationship with detection performance. However,
they may still be valuable for future analysis.
Overall, this thesis extends the study of VCM beyond CNN based task networks by evaluating multiple Vision Transformer based object detection models and demonstrating that VCM is adaptable to transformer based architectures. The findings support the broader relevance of VCM in modern machine oriented video processing, where transformer based vision models are becoming increasingly important.
Recommended Citation
Dhulipudi, Vaishnavi, "PERFORMANCE ANALYSIS OF VIDEO CODING FOR MACHINES WITH VISION TRANSFORMERS" (2026). Electronic Theses and Dissertations. 301.
https://digitalcommons.fau.edu/etd_general/301