Grounded Explainable Agent for Reasoning Segmentation
Figure 1: Overview of GEAR-Seg's multifaceted capabilities. Serving as both a zero-shot inference agent and a scalable data engine, it explicitly translates pixels into text to seamlessly support complex reasoning segmentation, dense referring segmentation, and fine-grained attribute grounding in long-tail domains.
GEAR-Seg introduces a paradigm shift from implicit end-to-end entanglement to transparent, modular reasoning for visual segmentation tasks.
We propose GEAR-Seg, an explicitly decoupled agent that shifts the field from implicit end-to-end entanglement to transparent, modular reasoning. By systematically replacing sparse category prompts with dense, mask-level attribute descriptions via a pixel-to-text paradigm, GEAR-Seg achieves competitive zero-shot performance in reasoning segmentation and dense referring segmentation.
Leveraging GEAR-Seg as an automated generation pipeline, we construct GEAR-131K, a massive, high-fidelity reasoning segmentation dataset (over 38k images, 656k QA-mask pairs). Breaking away from traditional single-target constraints, our dataset introduces a multi-dimensional taxonomy strictly tailored for real-world manipulation-oriented reasoning and functional affordances.
We propose a low-cost, data-centric distillation paradigm. By training end-to-end networks exclusively on our automatically generated dataset, we effectively distill the cognitive capabilities of the GEAR-Seg agent into downstream architectures. Student models—from large VLMs (e.g., LISA) to lightweight models (e.g., YOLOv8)—supervised solely by our generated data achieve highly competitive performance, closely matching the rigorous upper-bound metrics of their counterparts trained on costly human-annotated datasets.
An explicitly decoupled agent that transforms implicit reasoning into an explicit, trackable logic chain through pixel-to-text translation.
Figure 2: Overview of the GEAR-Seg framework. The agent explicitly decouples the reasoning segmentation task into class-agnostic perception (SAM 2), dense semantic description (DAM), and logic-driven abstraction (LLM), serving as both a zero-shot inference engine and a scalable data generator.
We employ SAM 2 in Everything Mode. A Mask Non-Maximum Suppression yields valid instance masks, ensuring no salient object or subtle background attribute is overlooked.
The core pixel-to-text paradigm shift. Using DAM, global context and localized features are extracted and fused to generate context-aware descriptions for each mask. This fine-grained text is the interpretable bedrock for reasoning.
Bypassing entangled VLMs, we deploy a plug-and-play LLM. Given the query and descriptions, the LLM explicitly evaluates semantic entailment to output valid mask indices. This enables two modes: Reasoning Segmentation (abstract deduction) and Referring Segmentation (fine-grained grounding).
GEAR-Seg inherently functions as a scalable data engine, automatically generating high-quality reasoning segmentation annotations through a closed-loop pipeline.
Figure 3: Overview of the GEAR-Seg data generation pipeline and operational modes. The engine fuses multi-granularity visual semantics to autonomously generate diverse annotations.
Lightweight VLM extracts global scene context and relationships
SAM 2 + DAM extract fine-grained semantic regions and attributes
Synthesizes modalities to generate queries, logic chains, and masks
Produces challenging queries with explicit logic chains
A massive reasoning segmentation dataset comprising over 38k images and 656k diverse QA-mask pairs, introducing a multifaceted taxonomy tailored for complex real-world manipulation-oriented reasoning.
Figure 4: Detailed statistics of the GEAR-131K benchmark. (a) Image distribution across source datasets (LVIS, VOC, Mapillary, ADE20K). (b) Proportion of the five specialized reasoning categories. (c) Word cloud illustrating the semantic diversity of targeted entities. (d) Comprehensive feature comparison against existing reasoning segmentation datasets.
Evaluates deductive capabilities based on contextual visual cues.
Targets object groups based on utility or affordances, breaking single-category limitations.
Designed for complex interaction tasks requiring multi-target coordination across distinct semantic categories.
Breaks the holistic object-level boundary by localizing fine-grained, sub-instance regions.
Evaluates fine-grained perception of physical properties (material, shape, state)—crucial for embodied agents.
Figure 5: Top: Representative examples of the 5-fold linguistic expansion in the GEAR-131K dataset. Bottom: Additional dataset visualizations demonstrating the variety of reasoning scenarios covered by the dataset.