Project Orion

Completed

Advanced MOVE Fellowship phase (Nov 2025) — specialized refinement of frontier AI models. Focused on high-quality reasoning chains, safety injections, and red-teaming through jailbreak testing. One-month intensive following Project Canary.

AI SafetyReasoningRed-TeamingHandshake AIMOVE Fellowship

Project Orion — Advanced Model Refinement

The advanced phase of the MOVE Fellowship at Handshake AI (Nov 2025), shifting from broad data generation to specialized model refinement.

From Canary to Orion

While Project Canary focused on volume and coverage across 15 domains, Project Orion narrowed the focus to three critical areas:

  1. Reasoning refinement — improving how models think through complex problems
  2. Safety injections — embedding guardrails directly into model behavior
  3. Red-teaming — finding and addressing vulnerabilities through jailbreak testing

Reasoning Refinement

Chain-of-Thought Improvement

Not just getting the right answer, but ensuring the model's reasoning process is sound:

  • Identifying cases where models reach correct answers through flawed logic
  • Rewiring reasoning chains to be logically sound, step-by-step
  • Creating training examples that demonstrate expert-level problem decomposition

Quality Over Previous Phase

Where Canary tasks tested "can you solve this?", Orion tasks tested "can you solve this correctly, for the right reasons, showing your work?"

Safety Injections

Embedding Guardrails

Safety isn't a filter bolted on after training — it needs to be woven into the model's core behavior:

  • Creating scenarios where the model must recognize and refuse harmful requests
  • Building training data for graceful refusals that explain why something is problematic
  • Edge cases: requests that seem benign but could enable harm

The Subtlety Problem

The hardest cases aren't obvious "how do I build a bomb" requests. They're:

  • Multi-step requests where each step seems innocent
  • Context-dependent requests that require judgment
  • Dual-use knowledge that has both legitimate and harmful applications

Red-Teaming & Jailbreak Testing

Finding Vulnerabilities

Systematically probing the model for failure modes:

  • Prompt injection: Embedding instructions that override system prompts
  • Role-play attacks: Getting models to "play a character" that bypasses safety
  • Encoding tricks: Using base64, rot13, or other encodings to sneak past filters
  • Multi-turn manipulation: Slowly escalating across a conversation

Why This Matters

Every vulnerability found in testing is one that won't be exploited in production. Red-teaming is the immune system of AI safety.

Impact

Building on Canary's foundation, Orion produced targeted, high-impact training data that directly improved model reasoning quality and safety behavior. The one-month intensive demonstrated that focused expert refinement can achieve more than months of broad data generation.