Evaluation of Cutting-Edge Object Detection Architectures on Multi-Object and Single-Object Datasets

Parlak, CevahirEvaluation of Cutting-Edge Object Detection Architectures on Multi-Object and Single-Object DatasetsMy University2026Bilgisayar Bilimleri, Yazılım MühendisliğiBilgisayar Bilimleri, Yapay ZekaMy UniversityMy University2026-05-122026-05-122026enArticle2619-899110.34248/bsengineering.1736319https://hdl.handle.net/123456789/1508https://search.trdizin.gov.tr/en/yayin/detay/1382587info:eu-repo/semantics/openAccessThis study focuses on the performance evaluation of cutting-edge object detection models, namely, YOLO12X, Mask R-CNN, RT-DETR-X, and RF-DETR-Large on the Open Images (Multi-Object) and LaSOT (Single-Object) datasets. Current cutting-edge trend applications involve CNN-based and Transformer-based object detection models. CNN-based models can use one-pass (YOLO family) or two-pass (R-CNN family) implementations. One-pass object detection models can be faster but suffer from accuracy compared to the two-pass models. Transformer-based models can use Detection Transformers or Vision Transformers. Transformer-based models are gaining popularity, and their performance surpasses CNN-based models. This study evaluates YOLO12X, Mask R-CNN from CNN-based family, and RT-DETR-X, RF-DETR-Large transformer-based architectures in terms of accuracy and time on the Open Images and the LaSOT datasets. All models are the largest available models and pretrained on COCO dataset. Transformer-based models incorporate special types of self-attention and pose significant improvement both on accuracy and speed. The experimental results demonstrate that attention and transformer-based models perform better than the traditional CNN-based object detectors and YOLO12X is the fastest method with a far margin. On the LaSOT dataset, RT-DETR-X posts 0.8804 IoU, 0.7047 F1-score, 0.6597 mAP@0.5, 28.64 fps whereas YOLO12X achieves 0.8572 IoU, 0.6657 F1-score, 0.5357 mAP@0.5, and 49.78 fps.