GrapeUL-YOLO: bidirectional cross-scale fusion with elliptical anchors for robust grape detection in orchards.
Xiuli Zhu, Zhenghong Yu, Chengwei Li
Abstract
Open AccessAccurate grape detection in orchards is a core link in realizing automated harvesting. To address the challenges in orchard environments, such as complex grape backgrounds, variable lighting conditions, and dense occlusion of fruits, this study proposes a highly robust real-time grape detection model for orchard scenarios, namely Grapevine Ultra-Lightweight YOLO (GrapeUL-YOLO). Based on YOLOv11, this model enhances detection performance through three innovative designs: firstly, it adopts a Cross-Scale Residual Feature Backbone (CSRB) as the feature extraction network, combining 16 × downsampling operation with modules such as C3k2_SP and SPPELAN, which reduces computational complexity while retaining multi-scale features of grapes from small clusters to entire clusters; secondly, it constructs an Adaptive Bidirectional Fusion Network (ABFN) in the detection Neck, and through CARAFE content-aware upsampling and a bidirectional cross-scale concatenation mechanism, it strengthens the interaction between spatial details and semantic information, thereby improving the feature fusion capability in scenes with dense occlusion; thirdly, it designs a shape-adaptive detection Head, which uses customized elliptical anchor boxes to match the natural shape of grapes and detects grape targets of different sizes according to scale division. Experimental results show that on the Embrapa WGISD dataset, the mAP@0.5 of GrapeUL-YOLO reaches 0.912, and the mAP@0.5:0.95 is 0.576, both outperforming 9 mainstream models including CenterNet and YOLOv11; meanwhile, the model has only 5.11M parameters and an average detection time of 16.9ms per image, achieving a balance between high precision and lightweight, and providing an efficient solution for automated grape detection and harvesting in orchards.