GEAR-Seg: A Grounded Explainable Agent for Reasoning Segmentation and Data Engine

Grounded Explainable Agent for Reasoning Segmentation

College of Biosystems Engineering and Food Science, Zhejiang University, Hangzhou 310058, China
ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University, Hangzhou 311215, China
GEAR-Seg Overview

Figure 1: Overview of GEAR-Seg's multifaceted capabilities. Serving as both a zero-shot inference agent and a scalable data engine, it explicitly translates pixels into text to seamlessly support complex reasoning segmentation, dense referring segmentation, and fine-grained attribute grounding in long-tail domains.

Contributions

GEAR-Seg introduces a paradigm shift from implicit end-to-end entanglement to transparent, modular reasoning for visual segmentation tasks.

1A Modular Agent with Pixel-to-Text Translation

We propose GEAR-Seg, an explicitly decoupled agent that shifts the field from implicit end-to-end entanglement to transparent, modular reasoning. By systematically replacing sparse category prompts with dense, mask-level attribute descriptions via a pixel-to-text paradigm, GEAR-Seg achieves competitive zero-shot performance in reasoning segmentation and dense referring segmentation.

2A Scalable Data Engine and Comprehensive Benchmark

Leveraging GEAR-Seg as an automated generation pipeline, we construct GEAR-131K, a massive, high-fidelity reasoning segmentation dataset (over 38k images, 656k QA-mask pairs). Breaking away from traditional single-target constraints, our dataset introduces a multi-dimensional taxonomy strictly tailored for real-world manipulation-oriented reasoning and functional affordances.

3Effective Knowledge Distillation Across Model Scales

We propose a low-cost, data-centric distillation paradigm. By training end-to-end networks exclusively on our automatically generated dataset, we effectively distill the cognitive capabilities of the GEAR-Seg agent into downstream architectures. Student models—from large VLMs (e.g., LISA) to lightweight models (e.g., YOLOv8)—supervised solely by our generated data achieve highly competitive performance, closely matching the rigorous upper-bound metrics of their counterparts trained on costly human-annotated datasets.

The GEAR-Seg Framework

An explicitly decoupled agent that transforms implicit reasoning into an explicit, trackable logic chain through pixel-to-text translation.

GEAR-Seg Framework

Figure 2: Overview of the GEAR-Seg framework. The agent explicitly decouples the reasoning segmentation task into class-agnostic perception (SAM 2), dense semantic description (DAM), and logic-driven abstraction (LLM), serving as both a zero-shot inference engine and a scalable data generator.

1Class-Agnostic Segmentation

We employ SAM 2 in Everything Mode. A Mask Non-Maximum Suppression yields valid instance masks, ensuring no salient object or subtle background attribute is overlooked.

2Dense Semantic Description

The core pixel-to-text paradigm shift. Using DAM, global context and localized features are extracted and fused to generate context-aware descriptions for each mask. This fine-grained text is the interpretable bedrock for reasoning.

3Logic-Driven Reasoning

Bypassing entangled VLMs, we deploy a plug-and-play LLM. Given the query and descriptions, the LLM explicitly evaluates semantic entailment to output valid mask indices. This enables two modes: Reasoning Segmentation (abstract deduction) and Referring Segmentation (fine-grained grounding).

Dataset Engine Workflow

GEAR-Seg inherently functions as a scalable data engine, automatically generating high-quality reasoning segmentation annotations through a closed-loop pipeline.

Dataset Engine Workflow

Figure 3: Overview of the GEAR-Seg data generation pipeline and operational modes. The engine fuses multi-granularity visual semantics to autonomously generate diverse annotations.

🖼️
Macro-Level Context

Lightweight VLM extracts global scene context and relationships

🔍
Micro-Level Details

SAM 2 + DAM extract fine-grained semantic regions and attributes

🧠
LLM Reasoning Core

Synthesizes modalities to generate queries, logic chains, and masks

📊
High-Quality Output

Produces challenging queries with explicit logic chains

GEAR-131K Benchmark

A massive reasoning segmentation dataset comprising over 38k images and 656k diverse QA-mask pairs, introducing a multifaceted taxonomy tailored for complex real-world manipulation-oriented reasoning.

38K+
Diverse Images
656K
QA-Mask Pairs
131K
Base Pairs
3,014
Distinct Target Entities
Dataset Statistics

Figure 4: Detailed statistics of the GEAR-131K benchmark. (a) Image distribution across source datasets (LVIS, VOC, Mapillary, ADE20K). (b) Proportion of the five specialized reasoning categories. (c) Word cloud illustrating the semantic diversity of targeted entities. (d) Comprehensive feature comparison against existing reasoning segmentation datasets.

Five Specialized Reasoning Categories

Commonsense Reasoning

Evaluates deductive capabilities based on contextual visual cues.

Example: Identifying the birthday cake from "What suggests a birthday celebration here?"
Functional Reasoning

Targets object groups based on utility or affordances, breaking single-category limitations.

Example: Retrieving cars, bicycles, and buses simultaneously given "Identify all means of transportation."
Manipulation-related Reasoning

Designed for complex interaction tasks requiring multi-target coordination across distinct semantic categories.

Example: Returning both the cake and knife for "Divide the dessert into several pieces."
Part-based Reasoning

Breaks the holistic object-level boundary by localizing fine-grained, sub-instance regions.

Example: Isolating "the safe-to-hold part of the knife".
Attribute-based Reasoning

Evaluates fine-grained perception of physical properties (material, shape, state)—crucial for embodied agents.

Example: Distinguishing a "plastic toy knife" from a "sharp metallic blade".

Dataset Visualizations & Linguistic Diversity

Dataset Visualizations

Figure 5: Top: Representative examples of the 5-fold linguistic expansion in the GEAR-131K dataset. Bottom: Additional dataset visualizations demonstrating the variety of reasoning scenarios covered by the dataset.