The Evaluation Problem — How to Tell Whether an AI Is Actually Following the Problem
Most AI evaluation rewards polished output. But the deeper question is simpler: did the system understand the assignment, or did it only produce something that looked close enough to pass?
Ask an AI to write a function and it will produce something polished, well-commented, and structurally clean. It might even include edge cases you didn’t mention. It looks like the system understood the assignment. But “looks like” and “did” are not the same thing — and the gap between them is where most AI evaluation falls apart.
The question we keep asking — “Is this model smart?” — is the wrong question. The better question is: did it read the task correctly?
The Wrong Scoreboard
Most AI benchmarks measure output quality. Can the model solve a coding problem? Can it pass a test? Can it generate text that sounds coherent? These are valid questions, but they’re measuring the wrong layer. They tell you whether the system produced something that looks right. They don’t tell you whether it understood what right means.
This is the same problem that plagues hiring. A candidate with a polished portfolio and smooth interview answers gets the offer. Six months later, you discover they can produce impressive deliverables when the brief is clear, but fall apart when requirements are ambiguous, contradictory, or spread across multiple documents. The interview tested presentation. It should have tested interpretation.
What Ambiguity Reveals
The most informative thing you can do when evaluating an AI system is give it an ambiguous specification and watch what happens.
When a system encounters ambiguity, one of several things happens:
- It detects the ambiguity and flags it — the best outcome, and the rarest
- It silently resolves the ambiguity by picking one interpretation — common, and dangerous when it picks wrong
- It ignores the ambiguous constraint entirely — producing output that works for the parts it understood and fails for the parts it didn’t
- It hallucinates a resolution — inventing a requirement that wasn’t in the spec
Observable Worker Roles
When you study AI behavior across many tasks, patterns emerge. Different types of problems activate different operational modes:
Some tasks are pure pattern matching — the system recognizes a structure it’s seen before and reproduces it. This is where AI shines. Boilerplate code, standard data transformations, well-documented API integrations. The system is essentially recalling and adapting known solutions.
Other tasks require semantic translation — converting a natural language specification into a precise technical implementation. This is harder, because natural language is inherently ambiguous.
The most demanding tasks require consequence reasoning — understanding not just what the code does, but what happens downstream. Most AI systems are excellent at pattern matching, competent at semantic translation, and unreliable at consequence reasoning.
Predictable Failure Signatures
Specific categories of problems consistently trip up even the most capable models:
- Floating-point discipline — financial calculations, scientific computing, and any domain where precision matters will silently produce wrong results
- Boundary conditions — does “between 10 and 20” include 10? The AI picks whichever felt more natural without verifying
- Reference semantics — the AI will default to whichever pattern it’s seen most often regardless of which language it’s writing in
- Transaction atomicity — the spec said atomic. The AI built a loop
Better Questions, Better Systems
The shift from “Is AI smart?” to “Does AI interpret correctly?” changes everything downstream. It changes how we build benchmarks — from clean specifications to deliberately ambiguous ones. It changes how we evaluate models — from output appearance to constraint adherence. It changes how we use AI — from trusting polished output to verifying interpretation.
And it mirrors a principle that applies far beyond AI: the quality of a system is determined by the quality of its evaluation. When you test the wrong thing, you optimize for the wrong thing. When you ask better questions, you get better systems.
Sixth in a series examining the real problems people face with AI — and what happens when you start measuring the right thing.
You’re reading 6 of 8.
Get notified when the next article drops. No marketing — one email per new article, unsubscribe any time.
