Multimodal Fusion

Multimodal fusion is how a model combines information from different input types (modalities) into a single representation it can reason over. In a VLM, the image and text must be merged somehow. The fusion strategy directly affects what the model can learn: poor fusion produces a model that looks at the image and reads the text independently without connecting them.

Three main fusion strategies exist. Early fusion concatenates raw inputs before processing: image patches and text tokens enter the same transformer from layer one (used by LLaVA, PaliGemma). Late fusion processes each modality through separate encoders, then combines their final representations (used by CLIP for retrieval). Cross-attention fusion processes modalities separately but inserts cross-attention layers where one modality can attend to the other (used by Flamingo, BLIP-2). Each has tradeoffs: early fusion captures fine-grained interactions but requires more compute; late fusion is efficient but misses cross-modal details; cross-attention is a middle ground. Some architectures combine approaches: BLIP-2's Q-Former uses cross-attention to create a compact visual summary that is then fed to the LLM as an early-fused input.

Fusion strategy choices impact practical model behavior. Early fusion VLMs tend to be better at spatial reasoning and counting because text generation has direct access to patch-level image features. Late fusion models are better for retrieval tasks where you need separate image and text embeddings. Understanding fusion tradeoffs helps teams pick the right VLM architecture for their use case.

Get Started Now

Get Started using Datature’s computer vision platform now for free.