Semantic Alignment for Multimodal Large Language Models

1 Zhejiang University     2 National University of Singapore     3 Alibaba Group
*Indicates Equal Contribution

Abstract

Research on Multi-modal Large Language Models (MLLMs) towards the multi-image cross-modal instruction has received increasing attention and made significant progress, particularly in scenarios involving closely resembling images (e.g., change captioning). Existing MLLMs typically follow a two-step process in their pipelines: first, extracting visual tokens independently for each input image, and then aligning these visual tokens from different images with the Large Language Model (LLM) in its textual feature space. However, the independent extraction of visual tokens for each image may result in different semantics being prioritized for different images in the first step, leading to a lack of preservation of linking information among images for subsequent LLM analysis. This issue becomes more serious in scenarios where significant variations exist among the images (e.g., visual storytelling). To address this challenge, we introduce Semantic Alignment for Multi-modal large language models (SAM). By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis and align the semantics of different images before feeding them into LLM. As the test bed, we propose a large-scale dataset named MmLINK consisting of 69K samples. Different from most existing datasets for MLLMs fine-tuning, our MmLINK dataset comprises multi-modal instructions with significantly diverse images. Extensive experiments on the group captioning task and the storytelling task prove the effectiveness of our SAM model, surpassing the state-of-the-art methods by a large margin (+37% for group captioning and +22% for storytelling on CIDEr score).

Introduction Figure

MLLMs can effectively address the reasoning tasks by aligning the similarities and pinpointing the differences between highly similar images. However, the effectiveness of MLLMs, including GPT-4V, diminishes when faced with significantly different images in terms of content, context, or style. This issue is particularly evident when models need to align character identities or knowledge concepts.

Dataset Curation

Dataset Figure

We begin by selecting images featuring characters in different poses (A), along with 2 another distinct characters (B, C). The selected images are segmented to isolate each character, after which they are merged into mask images. Inpainting technology is then utilized to fill in the background areas of these mask images to obtain the final images, using descriptions generated by ChatGPT. Text annotations are generated by InstructBLIP and further refined with ChatGPT.

Method

Method Figure

The core mechanism of our SAM model is the Bidirectional Semantic Guidance mechanism with two interactive processes: Assisted Visual Token Extraction (Part A) and Contextual Semantic Generation (Part B). In Part A, the Q-former module leverages the contextual semantics ci, which are generated from contextual images (i.e., images other than the currently perceived image) in the multi-modal instruction in Part B, to guide the extraction of visual tokens from the currently perceived image features. In Part B, the W-former module is utilized to select the contextual semantics from the visual context of contextual images. This selection process is facilitated by the attention mechanism in the adaptive adjustment, along with assistance from the initial visual tokens hi, which are extracted from the currently perceived image in Part A.

Qualitative Result

Case Figure

SAM successfully performs semantic alignment and produces accurate responses.