Zero-Shot 3D Shape Correspondence

Abstract

We propose a novel zero-shot approach to computing correspondences between 3D shapes. Existing approaches mainly focus on isometric and near-isometric shape pairs (e.g., human vs. human), but less attention has been given to strongly non-isometric and inter-class shape matching (e.g., human vs. cow). To this end, we introduce a fully automatic method that exploits the exceptional reasoning capabilities of recent foundation models in language and vision to tackle difficult shape correspondence problems. Our approach comprises multiple stages. First, we classify the 3D shapes in a zero-shot manner by feeding rendered shape views to a language-vision model (e.g., BLIP2) to generate a list of class proposals per shape. These proposals are unified into a single class per shape by employing the reasoning capabilities of ChatGPT. Second, we attempt to segment the two shapes in a zero-shot manner, but in contrast to the co-segmentation problem, we do not require a mutual set of semantic regions. Instead, we propose to exploit the in-context learning capabilities of ChatGPT to generate two different sets of semantic regions for each shape and a semantic mapping between them. This enables our approach to match strongly non-isometric shapes with significant differences in geometric structure. Finally, we employ the generated semantic mapping to produce coarse correspondences that can further be refined by the functional maps framework to produce dense point-to-point maps. Our approach, despite its simplicity, produces highly plausible results in a zero-shot manner, especially between strongly non-isometric shapes.

Method

Our proposed approach has three main components: (1) Zero-shot 3D shape classification: By feeding rendered k views of each shape to a BLIP2 [Li et al. 2023] model to generate class proposal lists. The proposals are unified using ChatGPT to produce a single class per shape. (2) Semantic region/mapping generation: In-context learning capabilities of ChatGPT are employed to produce a semantic region set for each shape and a semantic mapping between them. (3) Zero-shot 3D semantic segmentation: our proposed SAM-3D uses the semantic regions to segment the shapes, and the mapping is used to produce a sparse correspondence map that can be densified further using the functional maps framework [Ovsjanikov et al. 2012].

Citation

                    
    @inproceedings{abdelreheem2023zeroshot,
        title={Zero-Shot 3D Shape Correspondence}, 
        author={Ahmed Abdelreheem and Abdelrahman Eldesokey and Maks Ovsjanikov and Peter Wonka},
        booktitle = {SIGGRAPH Asia},
        year      = {2023},
    }