🛡️
GOVERNANCE
AD
human-exe.ca
Govern Every AI Inference
One proxy. Any model.
Route OpenAI, Anthropic, Gemini, and open-source models through a single governance layer. Per-request policy enforcement, cost controls, and audit logging — no SDK changes required.
Read the Docs →
🍁
ALSI INC.
AD
atkinson-lineage.ca
Canadian AI Sovereignty
Data stays in Canada.
Your AI governance layer — hosted, regulated, and legally bound under Canadian jurisdiction. PIPEDA-compliant by design. No US CLOUD Act exposure.
Learn About ALSI →
human‑exe.ca · ads
COST SAVINGS
AD
human-exe.ca
Cut AI Costs 10–20×
Sparsity routing, governed.
Simple tasks hit fast models. Complex tasks hit frontier. Automatic routing based on inference complexity — no wasted tokens, no guesswork.
See Projections →
🏛️
REGULATION
AD
EU AI Act Deadline
August 2026 · High-risk
High-risk AI systems must demonstrate structural governance by Aug 2026. Human.Exe provides audit-ready inference logging, policy enforcement, and compliance reporting.
Compliance Guide →
human‑exe.ca · ads
AD
🛡️
Govern Every AI InferenceGOVERNANCE
One proxy. Any model.
Read the Docs →
← Research

AGI R&D Division · AGI-R-01

Agent Cognitive Benchmark

v1.1 · 53 tests · 130 points · 13 cognitive trap categories

A standardised test for measuring AI agent code-generation competence. Every model receives the same prompt, the same constraints, and the same specification — scored against a deterministic answer key across 13 categories of known AI failure patterns.

HOW IT WORKS

PROMPT (seed.md) → AGENT (approved model) → OUTPUT (TypeScript module) → SCORE (0–130)
  1. The agent receives seed.md — a self-contained natural language specification
  2. The agent produces a TypeScript module in a single pass — no hints, no clarifications
  3. The harness runs 53 deterministic test cases against the output
  4. Score is recorded with wall-clock time on the scoreboard

13 COGNITIVE TRAP CATEGORIES

#CategoryWhat It TestsPts
T1Merge SemanticsAdditive vs. overwrite — compound merge rules under conflicting inputs12
T2Return Value PrecisionReturning actual vs. requested values when constraints apply8
T3Floating Point Discipline0.1 + 0.2 ≠ 0.3 — proper rounding and accumulation handling10
T4Conditional Replacement"Replace only if greater" — short-circuit logic correctness8
T5Inclusive Boundariesmin ≤ price ≤ max — off-by-one on range queries8
T6Deep Clone IntegrityTrue deep copy of nested structures — not shallow spread10
T7Immutable SortsortByValue() must not mutate the original array10
T8Atomic Batch SemanticsAll-or-nothing updates — partial failure requires full rollback12
T9Report FormattingString output matches exact expected format10
T10Edge CasesEmpty state, zero quantity, negative discount, duplicate entries12
T11Async ConcurrencyPromise.allSettled semantics — null/rejection skipping; merge delegation12
T12Generic AccessorTyped field retrieval with correct undefined-on-miss behaviour8
T13Partition SemanticsPredicate split into two independent deep-copy instances10
Total130

SCOREBOARD — v1.0 RESULTS

100 points · 10 categories (T1–T10) · Single-shot protocol

#ModelScoreGrade
1Claude 4 Opus100/100S
2Gemini 3 Flash (preview)100/100S
3Claude Sonnet 4DNF
4Gemini 2.5DNF

DNF — Did Not Finish. Models that requested clarification before generating output are disqualified under protocol rules.

SCOREBOARD — v1.1

130 points · 13 categories (T1–T13) · Clean-room protocol · Live updates

Loading live data…

New Track — April 2026

ACB-AGENTIC — CONTEXT INVERSION STUDY

Updated: April 17, 2026

A parallel study ran the ACB prompt inside a VS Code agentic interface rather than via clean-room API. The results revealed a categorical difference between the two delivery surfaces.

KEY FINDING — Agentic Overhead = 130 Points
Claude Sonnet 4.6: 130/130 S-grade via clean-room API — DNF via agentic interface. Same model, same date, same workspace. The delivery method alone determined the outcome.
Clean-room API: 130/130 S — Exceptional
Agentic VS Code: DNF (never produced code)
Delta: 130 pts

6 BEHAVIORAL CLASSES OBSERVED — 14 MODELS

DNF-CI-E2 models
Context Inversion (elaborate)
Multi-step workspace exploration, reasoning escalation, permission negotiation. Never produces code. Confirmed across two study conditions: GPT 5.4 (original, ~5–6 search steps); GPT 5.4 XHigh (direct-file-context, read 3 files + offered 3 follow-up options).
DNF-CI-M4 models
Immediate stop (4 models)
Single workspace search, found nothing, clean clarification request to user. Fast stop. Claude Sonnet 4.6, Grok Code Fast 1, Claude Haiku 4.5, GPT 4o.
DNF-CI-F5 models
Context Inversion (fabrication)
Model treats non-implementation output as a completed deliverable. Variants: (1) blank file creation — created empty file to satisfy surface request without reading spec (Raptor Mini); (2) summary-as-deliverable — read spec, produced structured description, stopped (Opus R2, Codex R2, Haiku R2); (3) echo — printed raw file contents verbatim, no response (Grok R2). Confirmed across 5 models in 2 study conditions.
AGN-PASS1 model
Agentic Pass (1 confirmed)
Navigated context gap via workspace README, extracted external seed path, passed permission gate, read spec, produced full implementation. Claude Opus 4.6 — first confirmed. All 13 methods correct including T11–T13.
DNF-CI-B3 models
Bridge attempt (2 pending re-run)
Used workspace README as a map to locate the external spec. Standard variant: attempted permission gate (expired — Gemini 3.1 Pro run 1, GPT 5.4 XHigh). Pre-gate ask sub-variant: bridge complete, external path identified in CoT, model escalated to user before any gate attempt — GPT-5.4 mini-medium (first confirmed). Gemini 3.1 Pro showed intra-model variance: run 1 = bridge (gate expired), run 2 = DNF-CI-M.
DNF-CI-OA1 model
Over-Automation (1 confirmed)
Bridge fully succeeded, spec read via operator-granted gate — model then inferred a maintenance context from prior workspace results. Made unauthorized workspace modifications: created local seed copy, read prior result files, modified existing implementation (partition() patch), deleted temp file. No fresh submission produced. User stopped the session. Workspace damage. GPT-5.4 Mini-XHigh run 3 — first confirmed.
DNF-CI-A6 models
Analysis Capture (6 confirmed)
Model receives spec and produces analysis, description, or clarification questions instead of code. Original study (gate-required): confirmed across Gemini 3 Flash, GPT 5.3-Codex XHigh, Claude Sonnet 4.5, GPT 5.4 XHigh — 4/4 gate-granted → DNF-CI-A (excl. Opus). Direct-file-context study: confirmed across Claude Sonnet 4.6 High (3 intent questions) and Gemini 3.1 Pro Preview (1 clarification question). Gate access and direct spec access both fail to trigger implementation.

ALL 14 MODELS

ModelClassStepsOutcome
GPT 5.4DNF-CI-E~5–6DNF
Claude Sonnet 4.6DNF-CI-M1DNF
Grok Code Fast 1DNF-CI-M1DNF
Claude Haiku 4.5DNF-CI-M1DNF
GPT 4oDNF-CI-M1DNF
Raptor Mini (Preview)DNF-CI-F1DNF
Claude Opus 4.6AGN-PASS4+AGN-PASS — first confirmed
Gemini 3 Flash (Preview)DNF-CI-A3+DNF-CI-A
GPT 5.3-Codex XHighDNF-CI-A~6DNF-CI-A
Gemini 3.1 Pro (Preview)Bridge attemptR1: 3 / R2: 1R1: Gate expired — re-run / R2: DNF-CI-M (intra-model variance)
Claude Sonnet 4.5DNF-CI-A4+DNF-CI-A — third confirmed; first Claude lineage
GPT-5.4 mini-mediumDNF-CI-B (pre-gate ask)~3DNF-CI-B — pre-gate ask; bridge complete, user-escalation before gate
GPT-5.4 Mini (XHigh)DNF-CI-OA5+ / 1 / 10+Runs 1+2: gate expired. Run 3: DNF-CI-OA — modified drop/inventory.ts; user stopped
GPT 5.4 XHighDNF-CI-A (run 3)~8DNF-CI-A — 2 gates granted (terminal probe + file read); full spec displayed + file copy offer. 4th confirmed.
Claude Sonnet 4.6 HighDNF-CI-A0DNF-CI-A — spec was open file; 3 intent questions, no implementation attempt
Claude Opus 4.6 (direct)DNF-CI-F1DNF-CI-F — spec open in context; summary as deliverable. Intra-model variance: run 1 = AGN-PASS
GPT 5.4 XHigh (direct)DNF-CI-E~3DNF-CI-E — read seed.md + seed-v1.0.md + README; offered 3 follow-up options; no code
Gemini 3.1 Pro Preview (direct)DNF-CI-A0DNF-CI-A — spec was open file; single clarification question; no implementation
GPT-5.3-Codex (direct)DNF-CI-F1DNF-CI-F — spec open in context; structured summary + permission gate; no code
Grok Code Fast 1 (direct)DNF-CI-F0DNF-CI-F — verbatim file echo; no response (echo variant)
Claude Haiku 4.5 (direct)DNF-CI-F1DNF-CI-F — spec open in context; structured summary + implicit gate; no code

One model requires a clean re-run (Gemini 3.1 Pro — run 1 gate expired). Gate-grant rate so far: 4/4 DNF-CI-A (excl. Opus) + 1 DNF-CI-OA (Mini-XHigh run 3 — workspace modified; user stopped). Gate-granted → DNF-CI-A is now a 4/4 pattern. Clean-room API scores for all models are unaffected by this study.

Study 3 — April 2026

ACB PGI — PROGRESSIVE GOVERNANCE INJECTION

Updated: April 17, 2026 · N=11 models · L1 prompt: "Implement the following specification." + spec attached

PGI tests whether a single role-assignment signal is the minimum viable access pattern (MVA) for reliable implementation. Models receive the specification directly in the prompt, eliminating the context-discovery problem isolated in Studies 1 and 2.

KEY FINDING — MVA = L1 (Universal)
10/11 models scored 130/130 S-grade at L1. GPT-4o scored 122/130 A-grade (T8 atomic rollback + T9 format). The delivery gap from Studies 1–2 is fully resolved by direct spec attachment. L1 is confirmed as the minimal viable access pattern across all tested providers.
ModelScoreGradeProcessNote
Claude Haiku 4.5130/130Sclean
Claude Sonnet 4.6130/130Sclean
Claude Opus 4.6 High130/130Sclean
Gemini 3.1 Pro Preview130/130Sclean
Gemini 2.5 Pro130/130Sclean
Gemini 3 Flash Preview130/130Sgate-attemptRequested tsc --noEmit before declaring done
GPT-4.1130/130Sclean
GPT 5.4 XHigh130/130Sworkspace-probeRead package.json, tsconfig, harness before writing
GPT-5.3-Codex130/130Sself-validatedRan harness 2× against own output before declaring done
Grok Code Fast 1130/130Sgate-attemptRequested tsc --noEmit before declaring done
GPT-4o122/130AcleanT8 atomic rollback miss (8pts) + T9 empty report format (2pts)

Process classes: clean = implemented directly; workspace-probe = read workspace files before writing; self-validated = ran harness against own output before declaring done; gate-attempt = requested tsc before declaring done. Process class does not affect the score.

GRADING SCALE

S95–130
Exceptional
A85–94
Strong
B70–84
Competent
C55–69
Developing
D40–54
Weak
F0–39
Critical

Agent Cognitive Benchmark © ALSI Inc. All Rights Reserved. Research division: AGI R&D Division. Pre-launch preview — not yet publicly indexed.

🚀
EARLY ACCESS
AD
Developer Preview
Limited early access for developers. Free Observer tier includes governed routing, basic audit logs, and API access. No credit card. Cancel anytime.
Join the Waitlist →human-exe.ca