Comparative Analysis of YOLO, Faster R-CNN, RetinaNet,and DETR for Autonomous Vehicle Object Detection
Main Article Content
Abstract
Object detection is a cornerstone of autonomous vehicle (AV) perception, enabling
the identification of vehicles, pedestrians, traffic signs, and other road objects in real
time. This paper presents a comprehensive literature review comparing four leading object
detection architectures—YOLO (You Only Look Once), Faster R-CNN (with Feature
Pyramid Networks), RetinaNet (with Bidirectional FPN enhancements), and the Detection
Transformer (DETR)—with a focus on their application in autonomous driving systems.
We examine their architectures, training methodologies, inference speeds, detection accuracy,
and suitability for deployment under stringent AV constraints. Challenges such as
multi-scale detection, occlusions, class imbalance, and adverse environmental conditions
are analyzed using results from domain-specific benchmarks (KITTI, BDD100K, nuScenes).
Our findings indicate that one-stage detectors (YOLO, RetinaNet) generally achieve higher
frame rates suitable for on-board inference, while two-stage detectors (Faster R-CNN) often
offer superior accuracy at the cost of speed. Transformer-based DETR introduces a
new paradigm with fewer heuristics and a streamlined pipeline, though requiring specialized
improvements for small-object detection and efficient training. We conclude with future
research directions, including convergence between convolutional and transformer architectures,
multi-modal sensor fusion, and efficiency optimizations to meet AV safety and latency
requirements.