FishDetector-R1: Unified MLLM-Based Framework with Reinforcement Fine-Tuning for Weakly Supervised Fish Detection, Segmentation, and Counting

Available in December 2025

*Equal Contribution 1University of Michigan
FishDetector-R1

We present FishDetector-R1, a unified MLLM-based framework with reinforcement fine-tuning for weakly supervised fish detection, segmentation, and counting.

Abstract

Analyzing underwater fish imagery is critical for ecological monitoring but remains difficult due to visual degradation and costly annotations. We introduce FishDetector-R1, a unified MLLM-based framework for fish detection, segmentation, and counting under weak supervision. On the DeepFish dataset, our framework achieves substantial gains over baselines, improving AP by 20% and mIoU by 10%, while reducing MAE by 30% and GAME by 35%. These improvements stem from two key components: a novel detect-to-count prompt that enforces spatially consistent detections and counts, and Reinforcement Learning from Verifiable Reward (RLVR) with a complementary scalable paradigm leveraging sparse point labels. Ablation studies further validate the effectiveness of this reward design. Moreover, the improvement generalizes well to other underwater datasets, confirming strong cross-domain robustness. Overall, FishDetector-R1 provides a reliable and scalable solution for accurate marine visual understanding via weak supervision.

Overview

We propose FishDetector-R1, the first unified framework to integrate an MLLM with a segmentation foundation model for comprehensive marine fish analysis (detection, segmentation, and counting) using only weak, point-level supervision.

We design a novel joint detect–to-count learning paradigm to adapt foundation models to the challenging underwater domain in a complementary manner. By formulating sparse point labels as verifiable rewards within an RLVR framework, our method enforces spatial and numerical consistency, enabling the generation of high-quality masks from minimal annotation.

We conduct extensive quantitative and qualitative experiments on the DeepFish dataset to demonstrate the effectiveness of the proposed FishDetector-R1 pipeline, achieving performance competitive with and even surpassing fully supervised methods. We further validate its strong generalization through zero-shot transfer on another underwater dataset.

FishDetector-R1 Overview

Qualitative Results on DeepFish Dataset

The proposed FishDetector-R1 framework achieves state-of-the-art performance on the DeepFish dataset, demonstrating its effectiveness in handling complex underwater environments.

Prompt Qualitative Results

Qualitative Comparison between Qwen2.5-VL and FishDetector-R1. On a challenging scene from DeepFish FishLoc, our detect-to-count strategy enables more accurate localization and structured outputs.

Segmentation Quality

Detection and segmentation results of FishDetector-R1 across diverse underwater habitats. The model demonstrates robustness to variations in background complexity, lighting conditions, and water color, highlighting its applicability to real-world marine environments.

Supplementary Figure 2

Qualitative detection results on the DeepFish FishLoc test set. In crowded scenes with many small fish, FishDetector-R1 delivers more accurate counts and tighter localizations. Best viewed in color and zoomed in.

Zero-shot Transfer on SUIM Dataset

Qualitative results on the SUIM dataset. (a) Our model generalizes to the SUIM domain and successfully identifies fish species that share visual similarities with those in DeepFish. (b) Visualization of additional fish and vertebrate categories present in SUIM but absent in DeepFish, demonstrating the model's cross-domain adaptability and retention of original MLLM-level generalizability.

Subplot 1

(a)

Subplot 2

(b)

BibTeX