Publication
DeiSAM: Segment Anything with Deictic Prompting
Hikaru Shindo; Manuel Brack; Gopika Sudhakaran; Devendra Singh Dhami; Patrick Schramowski; Kristian Kersting
In: Computing Research Repository eprint Journal (CoRR), Vol. abs/2402.14123, Pages 1-23, arXiv, 2024.
Abstract
Large-scale, pre-trained neural networks have demonstrated strong capabilities
in various tasks, including zero-shot image segmentation. To identify concrete
objects in complex scenes, humans instinctively rely on deictic descriptions in
natural language, i.e., referring to something depending on the context, such as
“The object that is on the desk and behind the cup”. However, deep learning ap-
proaches cannot reliably interpret such deictic representations as they have limited
reasoning capabilities, particularly in complex scenarios. Therefore, we propose
DeiSAM—a combination of large pre-trained neural networks with differentiable
logic reasoners—for deictic promptable segmentation. Given a complex, textual
segmentation description, DeiSAM leverages Large Language Models (LLMs)
to generate first-order logic rules and performs differentiable forward reasoning
on generated scene graphs. Subsequently, DeiSAM segments objects by match-
ing them to the logically inferred image regions. As part of our evaluation, we
propose the Deictic Visual Genome (DeiVG) dataset, containing paired visual
input and complex, deictic textual prompts. Our empirical results demonstrate that
DeiSAM is a substantial improvement over purely data-driven baselines for deictic
promptable segmentation.
