We propose FishDetector-R1, the first unified framework to integrate an MLLM with a segmentation foundation model for comprehensive marine fish analysis (detection, segmentation, and counting) using only weak, point-level supervision.
We design a novel joint detect–to-count learning paradigm to adapt foundation models to the challenging underwater domain in a complementary manner. By formulating sparse point labels as verifiable rewards within an RLVR framework, our method enforces spatial and numerical consistency, enabling the generation of high-quality masks from minimal annotation.
We conduct extensive quantitative and qualitative experiments on the DeepFish dataset to demonstrate the effectiveness of the proposed FishDetector-R1 pipeline, achieving performance competitive with and even surpassing fully supervised methods. We further validate its strong generalization through zero-shot transfer on another underwater dataset.
The proposed FishDetector-R1 framework achieves state-of-the-art performance on the DeepFish dataset, demonstrating its effectiveness in handling complex underwater environments.
Qualitative Comparison between Qwen2.5-VL and FishDetector-R1. On a challenging scene from DeepFish FishLoc, our detect-to-count strategy enables more accurate localization and structured outputs.
Detection and segmentation results of FishDetector-R1 across diverse underwater habitats. The model demonstrates robustness to variations in background complexity, lighting conditions, and water color, highlighting its applicability to real-world marine environments.
Qualitative detection results on the DeepFish FishLoc test set. In crowded scenes with many small fish, FishDetector-R1 delivers more accurate counts and tighter localizations. Best viewed in color and zoomed in.
Qualitative results on the SUIM dataset. (a) Our model generalizes to the SUIM domain and successfully identifies fish species that share visual similarities with those in DeepFish. (b) Visualization of additional fish and vertebrate categories present in SUIM but absent in DeepFish, demonstrating the model's cross-domain adaptability and retention of original MLLM-level generalizability.
(a)
(b)