The Interview Is Dead: What AI Evaluation Teaches Us About Hiring Humans

The Problem: We've built a sophisticated evaluation culture for AI (benchmarks like MMLU, HumanEval, SWE-bench) that actively drives model development — but we still evaluate humans with whiteboard puzzles and LeetCode trivia from the 1990s. These tests measure memorization, not real-world capability.
The Idea: AI evaluation works because it tests real capabilities, measures end-to-end output quality, reflects actual use cases, and evolves. Human interviews fail on every single one of these criteria. The irony: we know how to build good evaluations — we just haven't applied that knowledge to humans.
My Solution: Replace traditional interviews with deliverable-based collaboration interviews. Give candidates a real-world problem, let them use any AI tools they want (ChatGPT, Copilot, Cursor), and evaluate both the final artifact (60%) and their collaboration process (40%) — how they decompose problems, guide AI, catch hallucinations, and make judgment calls.
The Vision: Evaluation and capability co-evolve. If we start measuring what actually matters — collaboration, judgment, end-to-end delivery — we'll create a culture that produces better engineers. The interview isn't just a filter; it's a signal to the entire industry about what we value.
Zizhao Hu
PhD Student at USC · AI Researcher