Bengal: Document Unlearning (SHRED)

SHRED is an LLM unlearning method developed under the U.S. government IARPA Bengal program. The goal: surgically remove specific knowledge — such as private IP documents — from a pretrained model, without degrading its unrelated capabilities.

The Problem

Once a large language model has memorized a corpus, naively fine-tuning it to "forget" tends to either leave the knowledge recoverable or damage the model everywhere else. Document-level unlearning has to be both thorough (the forgotten content can't be elicited) and targeted (general capabilities survive).

The Method

SHRED combines two ingredients:

Self-distillation on the retain set — the model teaches itself to preserve behavior on everything it should still know.
Entropy demotion on the forget set — targeted knowledge is actively pushed down in the model's output distribution rather than simply masked.

Evaluation

Benchmarked against unlearning baselines such as RMU on standardized forget/retain evaluations (e.g., WMDP), measuring both unlearning effectiveness and retained capability.

Status

Paper in submission at NeurIPS 2026. PI: Dr. Robin Jia · Co-PI: Dr. Jesse Thomason.