🛡️
GOVERNANCE
AD
human-exe.ca
Govern Every AI Inference
One proxy. Any model.
Route OpenAI, Anthropic, Gemini, and open-source models through a single governance layer. Per-request policy enforcement, cost controls, and audit logging — no SDK changes required.
Read the Docs →
🍁
ALSI INC.
AD
atkinson-lineage.ca
Canadian AI Sovereignty
Data stays in Canada.
Your AI governance layer — hosted, regulated, and legally bound under Canadian jurisdiction. PIPEDA-compliant by design. No US CLOUD Act exposure.
Learn About ALSI →
human‑exe.ca · ads
COST SAVINGS
AD
human-exe.ca
Cut AI Costs 10–20×
Sparsity routing, governed.
Simple tasks hit fast models. Complex tasks hit frontier. Automatic routing based on inference complexity — no wasted tokens, no guesswork.
See Projections →
🏛️
REGULATION
AD
EU AI Act Deadline
August 2026 · High-risk
High-risk AI systems must demonstrate structural governance by Aug 2026. Human.Exe provides audit-ready inference logging, policy enforcement, and compliance reporting.
Compliance Guide →
human‑exe.ca · ads
AD
🛡️
Govern Every AI InferenceGOVERNANCE
One proxy. Any model.
Read the Docs →
← dot.awesome Dev Journal
ARCHITECT SERIES · 6 of 8
AD
🛡️
Govern Every AI InferenceGOVERNANCE
One proxy. Any model.
Read the Docs →
dot.awesome Dev Journal · HUMAN.EXE · ARCHITECT SERIES
Research7 min read
The Evaluation Problem — How to Tell Whether an AI Is Actually Following the Problem
🎙️
LISTEN WHILE YOU READ · 8:24
⏸ PAUSED · 8:24

The Evaluation Problem — How to Tell Whether an AI Is Actually Following the Problem

Most AI evaluation rewards polished output. But the deeper question is simpler: did the system understand the assignment, or did it only produce something that looked close enough to pass?

dot.awesomeMarch 18, 2026

Ask an AI to write a function and it will produce something polished, well-commented, and structurally clean. It might even include edge cases you didn’t mention. It looks like the system understood the assignment. But “looks like” and “did” are not the same thing — and the gap between them is where most AI evaluation falls apart.

The question we keep asking — “Is this model smart?” — is the wrong question. The better question is: did it read the task correctly?

The Wrong Scoreboard

Most AI benchmarks measure output quality. Can the model solve a coding problem? Can it pass a test? Can it generate text that sounds coherent? These are valid questions, but they’re measuring the wrong layer. They tell you whether the system produced something that looks right. They don’t tell you whether it understood what right means.

This is the same problem that plagues hiring. A candidate with a polished portfolio and smooth interview answers gets the offer. Six months later, you discover they can produce impressive deliverables when the brief is clear, but fall apart when requirements are ambiguous, contradictory, or spread across multiple documents. The interview tested presentation. It should have tested interpretation.

What Ambiguity Reveals

The most informative thing you can do when evaluating an AI system is give it an ambiguous specification and watch what happens.

When a system encounters ambiguity, one of several things happens:

  • It detects the ambiguity and flags it — the best outcome, and the rarest
  • It silently resolves the ambiguity by picking one interpretation — common, and dangerous when it picks wrong
  • It ignores the ambiguous constraint entirely — producing output that works for the parts it understood and fails for the parts it didn’t
  • It hallucinates a resolution — inventing a requirement that wasn’t in the spec

Observable Worker Roles

When you study AI behavior across many tasks, patterns emerge. Different types of problems activate different operational modes:

Some tasks are pure pattern matching — the system recognizes a structure it’s seen before and reproduces it. This is where AI shines. Boilerplate code, standard data transformations, well-documented API integrations. The system is essentially recalling and adapting known solutions.

Other tasks require semantic translation — converting a natural language specification into a precise technical implementation. This is harder, because natural language is inherently ambiguous.

The most demanding tasks require consequence reasoning — understanding not just what the code does, but what happens downstream. Most AI systems are excellent at pattern matching, competent at semantic translation, and unreliable at consequence reasoning.

Predictable Failure Signatures

Specific categories of problems consistently trip up even the most capable models:

  • Floating-point discipline — financial calculations, scientific computing, and any domain where precision matters will silently produce wrong results
  • Boundary conditions — does “between 10 and 20” include 10? The AI picks whichever felt more natural without verifying
  • Reference semantics — the AI will default to whichever pattern it’s seen most often regardless of which language it’s writing in
  • Transaction atomicity — the spec said atomic. The AI built a loop

Better Questions, Better Systems

The shift from “Is AI smart?” to “Does AI interpret correctly?” changes everything downstream. It changes how we build benchmarks — from clean specifications to deliberately ambiguous ones. It changes how we evaluate models — from output appearance to constraint adherence. It changes how we use AI — from trusting polished output to verifying interpretation.

And it mirrors a principle that applies far beyond AI: the quality of a system is determined by the quality of its evaluation. When you test the wrong thing, you optimize for the wrong thing. When you ask better questions, you get better systems.

Sixth in a series examining the real problems people face with AI — and what happens when you start measuring the right thing.

benchmarkingai-evaluationcognitive-testingworker-roles
🎙️ View full episode on podcast page →
Share this article
COST SAVINGS
AD
Cut AI Costs 10–20×
Simple tasks hit fast models. Complex tasks hit frontier. Automatic routing based on inference complexity — no wasted tokens, no guesswork.
See Projections →human-exe.ca
ARCHITECT SERIES

You’re reading 6 of 8.

Get notified when the next article drops. No marketing — one email per new article, unsubscribe any time.

NEXT IN SERIES · 7 of 8
The Stability Problem — Why a Useful AI System Has to Be Stable Before It Looks Intelligent
An impressive result is easy to overvalue. The harder question is whether the system stays grounded, calibrated, and recoverable when conditions stop being ideal.
Continue reading →
← Previous
The Continuity Problem
Next →
The Stability Problem
🚀
EARLY ACCESS
AD
Developer Preview
Limited early access for developers. Free Observer tier includes governed routing, basic audit logs, and API access. No credit card. Cancel anytime.
Join the Waitlist →human-exe.ca