Synthetic Data
High-quality data is the foundation of modern AI. My research investigates generate-validate loops for self-improving data pipelines, synthetic data generation, model collapse dynamics, data curation methods, and safety-oriented data pipelines—enabling AI systems to bootstrap their own improvement while maintaining quality and safety.
Key Research Topics
Generate-Validate Loop
Self-improving data pipelines where models generate candidate outputs, then validate them against criteria (correctness, safety, diversity) before accepting them as training data. This closed-loop approach produces higher-quality data than one-shot generation and enables models to bootstrap their own improvement without human annotation.
Synthetic Data Generation
Creating high-quality synthetic training data using generative models. Developing pipelines that produce diverse, balanced datasets without the privacy concerns or biases of web-scraped data, while ensuring the generated data is genuinely useful for model training.
Model Collapse Prevention
Understanding and preventing the phenomenon where models trained on synthetic data from previous model generations progressively degrade. Researching the feedback loops that cause model collapse and developing mitigation strategies.
Data Quality & Curation
Developing automated methods to assess, filter, and curate training data. Research on quality metrics, deduplication, bias detection, and data mixing strategies that optimize model performance per training dollar.
Safety Through Data
Using synthetic data generation as a lever for AI safety. Creating targeted safety training examples, red-teaming datasets, and alignment data that help models learn to refuse harmful requests while remaining helpful.
Related Publications
Multimodal Synthetic Data Finetuning and Model Collapse
Zizhao Hu, et al.