| Abstract: |
Detecting objects in video, especially when they are small, rapidly moving, or partially hidden, is still highly vulnerable to false alarms. Such erroneous detections directly compromise the trustworthiness of UAV surveillance, automated inspection systems, and other safety-sensitive perception pipelines. In this work, we introduce a multi-stage refinement architecture specifically designed to suppress false positives by exploiting complementary cues: spatial context, temporal coherence, anchor adaptation, non-maximum suppression behavior, and multi-scale feature enhancement. Starting from a Faster R-CNN backbone with an FPN, we augment the detector with a Graph Attention Contextual R-CNN to model relations between proposals, a bidirectional LSTM module for temporal feature aggregation, a K-means++ driven dynamic anchor generator, a differentiable Soft-NMS layer for score modulation, and a Multi-Level Dual Attention Refinement (MLDAR) block that operates across feature pyramid levels. Each component is grounded in a clear design motivation, accompanied by a mathematical formulation, and evaluated empirically on the Drone-vs-Bird (DVB) benchmark. The resulting system achieves notable gains in mAP, precision, recall, and false-positive rate compared with YOLOv8s, Faster R-CNN, and RetinaNet, with especially strong improvements for small airborne targets. |