Won the NeurIPS 2023, HomeRobot Open Vocabulary Mobile Manipulation (OVMM) Challenge


The HomeRobot Open Vocabulary Mobile Manipulation (OVMM) Challenge involved designing an embodied AI agent that is capable of navigating unfamiliar environments, recognizing open vocabulary classes and manipulating novel objects.

Challenge website

Our solution

I was part of the team that worked on the perception pipeline for this work. The baseline agent provided in the challenge used DETIC for open vocabulary segmentation. As an alternative, we tried using GroundingDINO with SAM, FastSAM, EfficientSAM. We observed that DETIC performed much better than GroundingDINO in this environment, especially when it came to identifying objects that are partially visible. To reduce the false positives, we tried using an object tagging model Recognize Anything Model (RAM), to identify all the objects in the scene, and use it as text prompts to DETIC and GroundingDINO. We also experimented by giving the task specific objects (start receptacle name, end receptacle name, object) explicitly to DETIC and GroundingDINO, along with the tags that RAM predicted, to prime the open vocabulary segmentation module.