Bengal: Document Unlearning (SHRED)
SHRED is an LLM unlearning method developed under the U.S. government IARPA Bengal program. The goal: surgically remove specific knowledge — such as private IP documents — from a pretrained model, without degrading its unrelated capabilities.
The Problem
Once a large language model has memorized a corpus, naively fine-tuning it to "forget" tends to either leave the knowledge recoverable or damage the model everywhere else. Document-level unlearning has to be both thorough (the forgotten content can't be elicited) and targeted (general capabilities survive).
The Method
SHRED combines two ingredients:
- Self-distillation on the retain set — the model teaches itself to preserve behavior on everything it should still know.
- Entropy demotion on the forget set — targeted knowledge is actively pushed down in the model's output distribution rather than simply masked.
Evaluation
Benchmarked against unlearning baselines such as RMU on standardized forget/retain evaluations (e.g., WMDP), measuring both unlearning effectiveness and retained capability.
Status
Paper in submission at NeurIPS 2026. PI: Dr. Robin Jia · Co-PI: Dr. Jesse Thomason.