projects

Bengal: Document Unlearning (SHRED)

Research

SHRED — a document-level LLM unlearning method developed under the U.S. government IARPA Bengal program. It combines self-distillation on the retain set with entropy demotion on the forget set to remove targeted knowledge (e.g., private IP documents) from pretrained LLMs without catastrophic damage to unrelated capabilities.

UnlearningLLMAI SafetySelf-DistillationResearch

Bengal: Document Unlearning (SHRED)

SHRED is an LLM unlearning method developed under the U.S. government IARPA Bengal program. The goal: surgically remove specific knowledge — such as private IP documents — from a pretrained model, without degrading its unrelated capabilities.

The Problem

Once a large language model has memorized a corpus, naively fine-tuning it to "forget" tends to either leave the knowledge recoverable or damage the model everywhere else. Document-level unlearning has to be both thorough (the forgotten content can't be elicited) and targeted (general capabilities survive).

The Method

SHRED combines two ingredients:

  • Self-distillation on the retain set — the model teaches itself to preserve behavior on everything it should still know.
  • Entropy demotion on the forget set — targeted knowledge is actively pushed down in the model's output distribution rather than simply masked.

Evaluation

Benchmarked against unlearning baselines such as RMU on standardized forget/retain evaluations (e.g., WMDP), measuring both unlearning effectiveness and retained capability.

Status

Paper in submission at NeurIPS 2026. PI: Dr. Robin Jia · Co-PI: Dr. Jesse Thomason.