Zizhao.md

directions

memory

Agentic Memory

Continual learning of AI agents — in-context learning, continual fine-tuning, and unlearning.

world model

World Model

In-context world models, adaptation to post-training task worlds, and adapting agents in evolving envs.

latency

Low-Latency AI

Efficient attention architectures, KV-cache compression, latent segmentation, and recurrent transformers.

safety

AI Safety

Synthetic data training, risks of multi-agent interaction, post-training guardrails, and AI behavioral study.

recent papers

012026memoryin submission · NeurIPS 2026
SHRED: Document Unlearning via Self-Distillation and Entropy Demotion
A document-level unlearning method that combines self-distillation on retain data with entropy demotion on the forget set. Removes targeted knowledge from LLMs without catastrophic damage to unrelated capabilities.
view
022026safetyin submission · EMNLP 2026
Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM
Persona effectiveness is task-type dependent: expert prompts consistently improve alignment-dependent tasks (safety, preference) but reliably damage pretraining-dependent knowledge retrieval. PRISM teaches models when to invoke a persona via intent-based self-modeling, preserving accuracy while keeping alignment gains.
view
032026latencyin progress
AttendTwice: Long-Context Inference via Dynamic Token-Level KV-Cache Selection
A two-pass attention scheme that dynamically selects which KV-cache tokens to attend to per query, enabling long-context inference at a fraction of the standard memory footprint.
042025safetyACM ICMI 2025
Multimodal Synthetic Data Finetuning and Model Collapse
Studies how vision-language models degrade when fine-tuned on AI-generated multimodal data. Characterizes the collapse dynamics specific to the multimodal regime and proposes mitigation strategies that preserve diversity across modalities.
view
052024latencypreprint
Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion
A brain-inspired MLP architecture with hemispheric lateralization applied to diffusion models. Shows competitive sample quality at reduced parameter count, suggesting structured asymmetry as an inductive bias for generative modeling.
view
062024latencypreprint
Static Key Attention in Vision
A more efficient attention variant for vision transformers that pre-computes a static key projection, reducing per-token compute while maintaining downstream task performance.
view

full list on Google Scholar

academic service

Reviewer · NeurIPS 2024–2026
Reviewer · ICLR 2024–2025
Reviewer · ICML 2024–2025
TA · DSCI 552 (USC)

Agentic Memory

World Model

Low-Latency AI

AI Safety

SHRED: Document Unlearning via Self-Distillation and Entropy Demotion

Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM

AttendTwice: Long-Context Inference via Dynamic Token-Level KV-Cache Selection

Multimodal Synthetic Data Finetuning and Model Collapse

Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion

Static Key Attention in Vision