Open-Set Object Retrieval in Clutter via Hybrid Vision-Language and Geometric Reasoning

Abstract

Robots must effectively retrieve novel objects in clutter, where a target may not be directly visible or accessible. In many real-world setups a robot must handle an open set of objects under constrained viewpoints and limited reachability; conditions often simplified in existing laboratory setups. Under these constraints, the robot must generate safe retrieval motions despite kinematic limitations and uncertainty regarding the novel objects’ shapes. This work proposes a complete modular pipeline to address such targeted object retrieval problems in clutter. The focus is on evaluating solutions for a critical component of this pipeline: reasoning about the order of blocking objects to be removed before the target is available to be retrieved. Vision-Language-Models (VLMs) have been recently argued as pretrained solutions that can reason about such spatial object relationships given RGB images. Traditional engineered solutions in this space aim to heuristically identify object relationships from depth. This work shows that pretrained VLMs alone (without finetuning) cannot yet outperform even random object selection. Engineering solutions are superior but also lead to a lot of failure cases. This work proposes hybrid strategies for targeted object retrieval that combine the visual reasoning of VLMs with engineered dependencies via 3D reasoning, which improve performance. These observations are confirmed in extensive physics-based simulation experiments and real world experiments for a setup involving a robotic arm with a parallel gripper and a torso-mounted stereo camera.

Problem

Targeted Object Retrieval in Clutter (TORC) arises in manufacturing, logistics and service robotics. In our paper, we address a specific version as follows:

Consider a scene of objects for which 3D models aren't provided. The objects initially rest stably on a support surface, on top of other objects, or both. Objects are allowed to occlude each other given the camera's view, which is from an RGB-D sensor attached to the robot's torso for ego-centric perception. No multi-view observation is available to perform full scene reconstruction due to the static robot base. The robot should pick a target object as fast as possible with a minimal number of picks and accidental object drops.

Pipeline

We set up the following open-loop pipline to address the TORC problem. First, camera RGB and depth images are fed into GSAM2 and TSDF fusion to generate a segmentation image and a 3D occlusion volume respectively. The camera point-cloud is used by Contact GraspNet to generate a grasps for the objects in the scene. Grasps in collision with other objects or have no collision-free IK solutions are filtered out. Then, a task-planning method makes a decision on what object to retrieve for this "pick." CuRobo then computes motions plan to retrieve the target object using its highest-scoring grasp. Finally, the motion plan is executed on the robot, and the robot moves onto the next pick. In each experiment robots are given at most 15 picks to retrieve the target object.

Scenes

Simulated

We created 40 unique experiments using 22 simulated scenes, with each scene containing a unique target object. Physics simulation and perception were performed using MuJoCo. These scenes were obtained by programmatically generating a large set of scenes that were then filtered down by scripts and human evaluation, as documented below. and multiple objects from each scene were used as target objects, leading to a total of. Therefore, the final set of scenes have non-conseuctive IDs. The following is an interactive visualization of our simulated scenes. The target object is marked by the "0" segmentation in red. You can use the slider below to scrub through the scenes.

Front View (Camera)

Back View

Dataset Generation

An important aspect of the evaluation involved generating useful scenes that are both challenging, i.e., they require multiple object removals till the target is reachable, but are also solvable. Randomly placing objects can frequently lead to scenes without valid grasps. These failure modes may arise due to poor grasp generation or due to complex object-object interactions during retraction, which cannot be trivially resolved even by leveraging privileged perception in simulation. Therefore, in order to focus on the task planning instead of other parts of the pipeline, generating the scene dataset consisted of manual labeling and tuning, as well as filtering using hand-specified rules. This resulted in a dataset of 40 TORC experiments, each with a set of objects placed on a tabletop and a target object to be retrieved.

To initialize the process, scenes were taken from GraspClutter6d. Each scene from GraspClutter6d consisted of a cluttered arrangement of objects as well as ground-truth grasps for each object. To produce the retrieval-focused scene dataset, the following steps were executed:

  1. For each scene, an automated script placed objects at the middle of a tabletop in MuJoCo.
  2. A human annotator manually labeled candidate targets in each scene, with preference for occluded objects.
  3. Candidate targets that were irretrievable were then pruned: For each target, all objects (1) directly in front of it, (2) within 15 cm to the left or right, and (3) taller than the target were recorded. If any of these objects possessed no ground-truth grasps that were collision-free (with the static scene) and IK-feasible, the target was pruned.
  4. Scenes were iteratively refined based on feedback from humans. People executed the task planning. If the expert selected the target immediately without removing any other objects, the TORC problem was discarded as trivial. Conversely, if the expert was unable to retrieve the target even after attempting to remove the majority of objects, the scene was adjusted (e.g., by modifying object poses or adding/removing objects) until it admitted a solution with a task sequence of at least two steps.

Real

We created three unique experiments using three real-world scenes to test our methods in a realistic setting.

Real Scene 1

Real Scene 2

Real Scene 3

Methods

DG-SELECT

An engineered selection method (DG-SELECT) is adapted from previous work, which uses dependency graphs that express object relationships. In the previous work, these dependencies could be computed explicitly given ground truth knowledge of the object's models and poses. Here, these dependencies are computed from perceptual information without access to object models.

VLM-SELECT

For the VLM-SELECT selection method, a VLM is prompted with a labeled image of the workspace and asked to come up with a sequence of actions to retrieve the target object. The VLM used in this work is Google Deepmind's Gemini Robotics-ER 1.5 since it is pretrained for robotics tasks and is available for public use. The result is parsed and the first object in the sequence is chosen. VLM-SELECT does not have knowledge of which objects have valid grasps. Due to the stochastic output of the VLM, it is called again if it fails to select a graspable object.

The VLM prompt for this method is shown below:


        

VLM-FIXES-DG

A hybrid approach used both a VLM and dependency planning, referred to here as VLM-FIXES-DG. In this approach a VLM is not prompted to directly return the object to be picked. Instead, it is prompted with candidate dependency relations between pairs of objects generated by the same reasoning employed by the DG-SELECT approach. The approach relies on the 'visual understanding' of the VLM to fix up incorrect dependencies and ultimately construct a more accurate dependency graph. In this way, it is less dependent on parameter tuning of the heuristics described in the previous section for DG-SELECT. Specifically, in the VLM-FIXES-DG approach the identification between 'below' and 'behind' relationships is performed by prompting the VLM. The general prompt outline is similar to that of VLM-SELECT but the output format is now a json array of dependencies and the main part of the task prompt is specified as follows with <candidates> being replaced at runtime with a json array of 'behind' dependencies.

The VLM prompt for this method is shown below:


        

VLM-GRASPS

The VLM+GRASPS selection method allows a VLM to make the ultimate choice of which object to be picked next but prompts the VLM with grasp dependency information. If the VLM selects an object that is not directly pickable, then the approach is attempted again.

The overall prompt outline is nearly identical to VLM-SELECT but it is given a scene description before the task is specified. The scene is described as a list of grasp dependencies written in natural language sentences of the form “Object A is blocked by object B” or of the form “Object C is graspable” if object C has valid grasps available. As explained in the subsection on grasping above, valid grasps are those whose IK solutions were found not to be in collision with other objects or the static environment, in which the gripper was found to not be in collision with the object to be grasped, and in which the object to be grasped is the only one between the gripper fingers at the grasp pose.

The VLM prompt for this method is shown below:


        

Experiments

Simulated

We evaluated our methods across 40 simulated scenes with eight repeated runs per scene. Since our data collection method does not record videos of runs to save on storage, we've recorded a few simulated demonstrations to illustrate how a run might look like. These additional demonstrations are shown below at x8 speed and only depict grasping motions.

Real

We evaluated each method across three real scenes with three repeated runs per scene. A sample of real experiements is shown below. These videos are shown at x8 speed and only depict grasping motions.

Results

Simulated

In addition to the four object selection methods, a random method was included as a baseline. Five different human experts were also asked to perform object selection on each of the 40 problems, for a total of 200 human trials. The corresponding approach is marked as HUMAN. Each selection strategy (RANDOM, VLM-SELECT, DG-SELECT, VLM+GRASPS, VLM-FIXES-DG) was evaluated 10 times on each problem for a total of 400 experiments per approach. In simulated experiments, ground-truth object instance segmentation from MuJoCo is used to isolate the evaluation of object sequencing.

The results show that using a VLM out of the box (despite being pretrained for robotics) is even under-performing making random selections among directly graspable objects. The dependency graph approach, DG-SELECT, is more successful than RANDOM and VLM-SELECT but critically depends on the parameters of the heuristics, which were tuned to optimize performance in these experiments. The VLM-FIXES-DG approach, which used the VLM to automatically define the dependency graph achieved similar success rate to DG-SELECTwith some increase in number of objects picked till success was achieved. Feeding the grasp information into the VLM and allowing the VLM to make the object selection as in the VLM+GRASPS approach achieves the highest success rate, while requiring a similar number of objects to be picked as in the DG-SELECT approach. The VLM-based solutions required increased computation time due to the call to the remote Gemini service. All the automated task planners exhibit similar rates of non-target objects rolling off the tabletop during the retrieval process, which in some cases arises due to simulation artifacts.

Real

The real-world experiments were performed with a single arm of a Yaskawa Motoman SDA10F robot using a Robotiq 85 gripper and a Zed Mini camera. Grounding SAM 2 is used for object segmentation.

On the real robot, the task planning approaches were evaluated on 3 difference scenes, where each method was repeated 3 times per scene, for a total of 9 experiments per method and 45 total experiments. These scenes were constructed by placing household objects on a large grid on a tabletop. Their positions was recorded for replication to be evaluated across the methods. The scenes were designed such that they required at least 2 picks to retrieve the target. Results are reported similarly to the simulated scenes.

The real world experiments yield similar observations to the simulated experiments. While the real system consumes more noisy perception data, which can lead to incorrect target detection, success was quite high. A potential advantage of the VLM-based solutions is that they experience a smaller real-to-sim gap given real image input. Still, the VLM-SELECT approach underperforms the alternatives, as it has the lower success rate. The methods that make the final selection via a Dependency Graph, i.e., DG-SELECT and VLM-FIXES-DG, have high success rate but still occasionally dropped objects. The VLM+GRASPS hybrid solution was the best performing method, solving all real scenes without any object drops similar to the human experts.

BibTeX

@inproceedings{anonymous2026opensetclutter,
  author    = {Anonymous},
  title     = {Open-Set Object Retrieval in Clutter via Hybrid Vision-Language and Geometric Reasoning},
  journal   = {In Submission},
  year      = {2026},
}