Multi-Modal Learning: Bridging Vision and Language

December 20, 2023
15 min read
VisionNLPMulti-ModalTransformers
Multi-Modal Learning: Bridging Vision and Language

The Problem: AI systems have traditionally processed vision and language in isolation. A vision-only model can identify 'a person running' but can't explain why; a language-only model can discuss 'red cars' but has no grounding in what 'red' looks like. This disconnect limits real-world understanding.

The Idea: The transformer architecture's attention mechanism naturally handles sequences of any kind — whether image patches or word tokens. This opens the door to unified models that process both modalities in a shared semantic space, enabling cross-modal reasoning.

My Solution: Static Key Attention — a more efficient attention variant for vision transformers that pre-computes certain attention patterns, reducing computational cost while maintaining performance. This builds on the evolution from separate encoders (CLIP) to deeply fused models (BLIP, Flamingo) to fully unified transformers.

The Vision: AI systems with human-like multimodal understanding — systems that don't just process images and text separately but truly comprehend the world through multiple complementary channels. This leads to embodied AI, world models, and multimodal scientific reasoning.

ZH

Zizhao Hu

PhD Student at USC · AI Researcher