Author Type

Graduate Student

Date of Award

Spring 5-13-2026

Document Type

Thesis

Publication Status

Version of Record

Submission Date

May 2026

Department

Computer and Electrical Engineering and Computer Science

College Granting Degree

College of Engineering and Computer Science

Department Granting Degree

Electrical Engineering and Computer Science

Degree Name

Master of Science (MS)

Thesis/Dissertation Advisor [Chair]

Hari Kalva

Abstract

This thesis investigates the performance of Video Coding for Machines (VCM) with Vision Transformer based object detection models. While existing VCM studies and tool designs have largely been developed under CNN-based assumptions, recent advances in computer vision have shown the growing importance of transformer based models. Motivated by this shift, this work studies whether VCM compressed data remains suitable for Vision Transformer based inference in addition to conventional CNN-based task networks.

To address this problem, three representative transformer based object detection models were selected: DETR, SWIN, and YOLOS. These models were chosen to represent different architectural styles, namely a CNN backbone with a transformer detector, a hierarchical transformer backbone with a conventional detection framework, and a pure Vision Transformer based detector. A VCM compressed dataset was generated from SFU video sequences under six configurations: Random Access Inner, Random Access E2E, Low Delay Inner, Low Delay E2E, All Intra Inner, and All Intra E2E. Object detection performance was evaluated using mean Average Precision (mAP), along with absolute and relative performance differences with respect to uncompressed data.

The results show that VCM compressed data remains suitable for transformer based object detection models. Across the evaluated coding configurations, detection performance generally improved as bitrate increased, indicating that the information preserved by the VCM pipeline remains useful for transformer based inference. Among the evaluated transformer-based models, SWIN showed the strongest and most consistent overall performance across most sequences and configurations. In addition to the main comparative analysis, this thesis also carried out an exploratory sequence level study using object count, object size, and frame variance. These parameters were useful for describing content characteristics, although none of them alone showed a strict or universal relationship with detection performance. However,

they may still be valuable for future analysis.

Overall, this thesis extends the study of VCM beyond CNN based task networks by evaluating multiple Vision Transformer based object detection models and demonstrating that VCM is adaptable to transformer based architectures. The findings support the broader relevance of VCM in modern machine oriented video processing, where transformer based vision models are becoming increasingly important.

Share

COinS