Project Orion — Advanced Model Refinement

The advanced phase of the MOVE Fellowship at Handshake AI (Nov 2025), shifting from broad data generation to specialized model refinement.

From Canary to Orion

While Project Canary focused on volume and coverage across 15 domains, Project Orion narrowed the focus to three critical areas:

Reasoning refinement — improving how models think through complex problems
Safety injections — embedding guardrails directly into model behavior
Red-teaming — finding and addressing vulnerabilities through jailbreak testing

Reasoning Refinement

Chain-of-Thought Improvement

Not just getting the right answer, but ensuring the model's reasoning process is sound:

Identifying cases where models reach correct answers through flawed logic
Rewiring reasoning chains to be logically sound, step-by-step
Creating training examples that demonstrate expert-level problem decomposition

Quality Over Previous Phase

Where Canary tasks tested "can you solve this?", Orion tasks tested "can you solve this correctly, for the right reasons, showing your work?"

Safety Injections

Embedding Guardrails

Safety isn't a filter bolted on after training — it needs to be woven into the model's core behavior:

Creating scenarios where the model must recognize and refuse harmful requests
Building training data for graceful refusals that explain why something is problematic
Edge cases: requests that seem benign but could enable harm

The Subtlety Problem

The hardest cases aren't obvious "how do I build a bomb" requests. They're:

Multi-step requests where each step seems innocent
Context-dependent requests that require judgment
Dual-use knowledge that has both legitimate and harmful applications

Red-Teaming & Jailbreak Testing

Finding Vulnerabilities

Systematically probing the model for failure modes:

Prompt injection: Embedding instructions that override system prompts
Role-play attacks: Getting models to "play a character" that bypasses safety
Encoding tricks: Using base64, rot13, or other encodings to sneak past filters
Multi-turn manipulation: Slowly escalating across a conversation

Why This Matters

Every vulnerability found in testing is one that won't be exploited in production. Red-teaming is the immune system of AI safety.

Impact

Building on Canary's foundation, Orion produced targeted, high-impact training data that directly improved model reasoning quality and safety behavior. The one-month intensive demonstrated that focused expert refinement can achieve more than months of broad data generation.