Architecture
Building the architectural foundations for scalable, memory-efficient AI systems. My work focuses on transformer memory mechanisms for long-range reasoning, efficient architectures that maximize capability per FLOP, multimodal designs for unified perception-language-action, and scalable architectures that grow from research to production.
Key Research Topics
Transformer Memory Mechanisms
How transformers store, retrieve, and reason over information. Research on KV-cache architectures, recurrent memory layers, state-space models (Mamba/S4), memory-augmented attention, and hybrid designs that give transformers explicit long-term memory without quadratic cost.
Efficient Architecture
Reducing compute and memory costs without sacrificing capability. Static key attention, sparse attention patterns, linear attention variants, weight sharing, knowledge distillation, and quantization-aware architecture design for deployment on constrained hardware.
Multimodal Architecture
Unified backbones that natively process vision, language, audio, and action in a single model. Research on early vs. late fusion strategies, cross-modal attention, modality-specific tokenization, and architectures that scale gracefully across input types.
Scalable Architecture
Designs that scale from small research models to production systems. Mixture of Experts (MoE) for conditional computation, expert routing and load balancing, pipeline and tensor parallelism-friendly architectures, and brain-inspired lateralization for asymmetric processing.
Related Publications
Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion
Zizhao Hu, et al.