Research Direction

Synthetic Data

High-quality data is the foundation of modern AI. My research investigates generate-validate loops for self-improving data pipelines, synthetic data generation, model collapse dynamics, data curation methods, and safety-oriented data pipelines—enabling AI systems to bootstrap their own improvement while maintaining quality and safety.

Key Research Topics

Generate-Validate Loop

Self-improving data pipelines where models generate candidate outputs, then validate them against criteria (correctness, safety, diversity) before accepting them as training data. This closed-loop approach produces higher-quality data than one-shot generation and enables models to bootstrap their own improvement without human annotation.

Synthetic Data Generation

Creating high-quality synthetic training data using generative models. Developing pipelines that produce diverse, balanced datasets without the privacy concerns or biases of web-scraped data, while ensuring the generated data is genuinely useful for model training.

Model Collapse Prevention

Understanding and preventing the phenomenon where models trained on synthetic data from previous model generations progressively degrade. Researching the feedback loops that cause model collapse and developing mitigation strategies.

Data Quality & Curation

Developing automated methods to assess, filter, and curate training data. Research on quality metrics, deduplication, bias detection, and data mixing strategies that optimize model performance per training dollar.

Safety Through Data

Using synthetic data generation as a lever for AI safety. Creating targeted safety training examples, red-teaming datasets, and alignment data that help models learn to refuse harmful requests while remaining helpful.

Related Publications

Featured

Multimodal Synthetic Data Finetuning and Model Collapse

Zizhao Hu, et al.

2025ACM International Conference on Multimodal Interaction (ICMI)
View Paper