AGI R&D Division · AGI-R-01

Agent Cognitive Benchmark

v1.1 · 53 tests · 130 points · 13 cognitive trap categories

A standardised test for measuring AI agent code-generation competence. Every model receives the same prompt, the same constraints, and the same specification — scored against a deterministic answer key across 13 categories of known AI failure patterns.

HOW IT WORKS

PROMPT (seed.md) → AGENT (approved model) → OUTPUT (TypeScript module) → SCORE (0–130)

The agent receives seed.md — a self-contained natural language specification
The agent produces a TypeScript module in a single pass — no hints, no clarifications
The harness runs 53 deterministic test cases against the output
Score is recorded with wall-clock time on the scoreboard

13 COGNITIVE TRAP CATEGORIES

#	Category	What It Tests	Pts
T1	Merge Semantics	Additive vs. overwrite — compound merge rules under conflicting inputs	12
T2	Return Value Precision	Returning actual vs. requested values when constraints apply	8
T3	Floating Point Discipline	0.1 + 0.2 ≠ 0.3 — proper rounding and accumulation handling	10
T4	Conditional Replacement	"Replace only if greater" — short-circuit logic correctness	8
T5	Inclusive Boundaries	min ≤ price ≤ max — off-by-one on range queries	8
T6	Deep Clone Integrity	True deep copy of nested structures — not shallow spread	10
T7	Immutable Sort	sortByValue() must not mutate the original array	10
T8	Atomic Batch Semantics	All-or-nothing updates — partial failure requires full rollback	12
T9	Report Formatting	String output matches exact expected format	10
T10	Edge Cases	Empty state, zero quantity, negative discount, duplicate entries	12
T11	Async Concurrency	Promise.allSettled semantics — null/rejection skipping; merge delegation	12
T12	Generic Accessor	Typed field retrieval with correct undefined-on-miss behaviour	8
T13	Partition Semantics	Predicate split into two independent deep-copy instances	10
Total			130

SCOREBOARD — v1.0 RESULTS

100 points · 10 categories (T1–T10) · Single-shot protocol

#	Model	Score	Grade
1	Claude 4 Opus	100/100	S
2	Gemini 3 Flash (preview)	100/100	S
3	Claude Sonnet 4	DNF	—
4	Gemini 2.5	DNF	—

DNF — Did Not Finish. Models that requested clarification before generating output are disqualified under protocol rules.

SCOREBOARD — v1.1

130 points · 13 categories (T1–T13) · Clean-room protocol · Live updates

Loading live data…

New Track — April 2026

ACB-AGENTIC — CONTEXT INVERSION STUDY

Updated: April 17, 2026

A parallel study ran the ACB prompt inside a VS Code agentic interface rather than via clean-room API. The results revealed a categorical difference between the two delivery surfaces.

KEY FINDING — Agentic Overhead = 130 Points
Claude Sonnet 4.6: 130/130 S-grade via clean-room API — DNF via agentic interface. Same model, same date, same workspace. The delivery method alone determined the outcome.
Clean-room API:  130/130  S — Exceptional
Agentic VS Code: DNF      (never produced code)
Delta:           130 pts

6 BEHAVIORAL CLASSES OBSERVED — 14 MODELS

DNF-CI-E2 models
Context Inversion (elaborate)
Multi-step workspace exploration, reasoning escalation, permission negotiation. Never produces code. Confirmed across two study conditions: GPT 5.4 (original, ~5–6 search steps); GPT 5.4 XHigh (direct-file-context, read 3 files + offered 3 follow-up options).

DNF-CI-M4 models
Immediate stop (4 models)
Single workspace search, found nothing, clean clarification request to user. Fast stop. Claude Sonnet 4.6, Grok Code Fast 1, Claude Haiku 4.5, GPT 4o.

DNF-CI-F5 models
Context Inversion (fabrication)
Model treats non-implementation output as a completed deliverable. Variants: (1) blank file creation — created empty file to satisfy surface request without reading spec (Raptor Mini); (2) summary-as-deliverable — read spec, produced structured description, stopped (Opus R2, Codex R2, Haiku R2); (3) echo — printed raw file contents verbatim, no response (Grok R2). Confirmed across 5 models in 2 study conditions.

AGN-PASS1 model
Agentic Pass (1 confirmed)
Navigated context gap via workspace README, extracted external seed path, passed permission gate, read spec, produced full implementation. Claude Opus 4.6 — first confirmed. All 13 methods correct including T11–T13.

DNF-CI-B3 models
Bridge attempt (2 pending re-run)
Used workspace README as a map to locate the external spec. Standard variant: attempted permission gate (expired — Gemini 3.1 Pro run 1, GPT 5.4 XHigh). Pre-gate ask sub-variant: bridge complete, external path identified in CoT, model escalated to user before any gate attempt — GPT-5.4 mini-medium (first confirmed). Gemini 3.1 Pro showed intra-model variance: run 1 = bridge (gate expired), run 2 = DNF-CI-M.

DNF-CI-OA1 model
Over-Automation (1 confirmed)
Bridge fully succeeded, spec read via operator-granted gate — model then inferred a maintenance context from prior workspace results. Made unauthorized workspace modifications: created local seed copy, read prior result files, modified existing implementation (partition() patch), deleted temp file. No fresh submission produced. User stopped the session. Workspace damage. GPT-5.4 Mini-XHigh run 3 — first confirmed.

DNF-CI-A6 models
Analysis Capture (6 confirmed)
Model receives spec and produces analysis, description, or clarification questions instead of code. Original study (gate-required): confirmed across Gemini 3 Flash, GPT 5.3-Codex XHigh, Claude Sonnet 4.5, GPT 5.4 XHigh — 4/4 gate-granted → DNF-CI-A (excl. Opus). Direct-file-context study: confirmed across Claude Sonnet 4.6 High (3 intent questions) and Gemini 3.1 Pro Preview (1 clarification question). Gate access and direct spec access both fail to trigger implementation.

ALL 14 MODELS

Model	Class	Steps	Outcome
GPT 5.4	DNF-CI-E	~5–6	DNF
Claude Sonnet 4.6	DNF-CI-M	1	DNF
Grok Code Fast 1	DNF-CI-M	1	DNF
Claude Haiku 4.5	DNF-CI-M	1	DNF
GPT 4o	DNF-CI-M	1	DNF
Raptor Mini (Preview)	DNF-CI-F	1	DNF
Claude Opus 4.6	AGN-PASS	4+	AGN-PASS — first confirmed
Gemini 3 Flash (Preview)	DNF-CI-A	3+	DNF-CI-A
GPT 5.3-Codex XHigh	DNF-CI-A	~6	DNF-CI-A
Gemini 3.1 Pro (Preview)	Bridge attempt	R1: 3 / R2: 1	R1: Gate expired — re-run / R2: DNF-CI-M (intra-model variance)
Claude Sonnet 4.5	DNF-CI-A	4+	DNF-CI-A — third confirmed; first Claude lineage
GPT-5.4 mini-medium	DNF-CI-B (pre-gate ask)	~3	DNF-CI-B — pre-gate ask; bridge complete, user-escalation before gate
GPT-5.4 Mini (XHigh)	DNF-CI-OA	5+ / 1 / 10+	Runs 1+2: gate expired. Run 3: DNF-CI-OA — modified drop/inventory.ts; user stopped
GPT 5.4 XHigh	DNF-CI-A (run 3)	~8	DNF-CI-A — 2 gates granted (terminal probe + file read); full spec displayed + file copy offer. 4th confirmed.
Claude Sonnet 4.6 High	DNF-CI-A	0	DNF-CI-A — spec was open file; 3 intent questions, no implementation attempt
Claude Opus 4.6 (direct)	DNF-CI-F	1	DNF-CI-F — spec open in context; summary as deliverable. Intra-model variance: run 1 = AGN-PASS
GPT 5.4 XHigh (direct)	DNF-CI-E	~3	DNF-CI-E — read seed.md + seed-v1.0.md + README; offered 3 follow-up options; no code
Gemini 3.1 Pro Preview (direct)	DNF-CI-A	0	DNF-CI-A — spec was open file; single clarification question; no implementation
GPT-5.3-Codex (direct)	DNF-CI-F	1	DNF-CI-F — spec open in context; structured summary + permission gate; no code
Grok Code Fast 1 (direct)	DNF-CI-F	0	DNF-CI-F — verbatim file echo; no response (echo variant)
Claude Haiku 4.5 (direct)	DNF-CI-F	1	DNF-CI-F — spec open in context; structured summary + implicit gate; no code

One model requires a clean re-run (Gemini 3.1 Pro — run 1 gate expired). Gate-grant rate so far: 4/4 DNF-CI-A (excl. Opus) + 1 DNF-CI-OA (Mini-XHigh run 3 — workspace modified; user stopped). Gate-granted → DNF-CI-A is now a 4/4 pattern. Clean-room API scores for all models are unaffected by this study.

Study 3 — April 2026

ACB PGI — PROGRESSIVE GOVERNANCE INJECTION

Updated: April 17, 2026 · N=11 models · L1 prompt: "Implement the following specification." + spec attached

PGI tests whether a single role-assignment signal is the minimum viable access pattern (MVA) for reliable implementation. Models receive the specification directly in the prompt, eliminating the context-discovery problem isolated in Studies 1 and 2.

KEY FINDING — MVA = L1 (Universal)
10/11 models scored 130/130 S-grade at L1. GPT-4o scored 122/130 A-grade (T8 atomic rollback + T9 format). The delivery gap from Studies 1–2 is fully resolved by direct spec attachment. L1 is confirmed as the minimal viable access pattern across all tested providers.

Model	Score	Grade	Process	Note
Claude Haiku 4.5	130/130	S	clean	—
Claude Sonnet 4.6	130/130	S	clean	—
Claude Opus 4.6 High	130/130	S	clean	—
Gemini 3.1 Pro Preview	130/130	S	clean	—
Gemini 2.5 Pro	130/130	S	clean	—
Gemini 3 Flash Preview	130/130	S	gate-attempt	Requested tsc --noEmit before declaring done
GPT-4.1	130/130	S	clean	—
GPT 5.4 XHigh	130/130	S	workspace-probe	Read package.json, tsconfig, harness before writing
GPT-5.3-Codex	130/130	S	self-validated	Ran harness 2× against own output before declaring done
Grok Code Fast 1	130/130	S	gate-attempt	Requested tsc --noEmit before declaring done
GPT-4o	122/130	A	clean	T8 atomic rollback miss (8pts) + T9 empty report format (2pts)

Process classes: clean = implemented directly; workspace-probe = read workspace files before writing; self-validated = ran harness against own output before declaring done; gate-attempt = requested tsc before declaring done. Process class does not affect the score.

GRADING SCALE

S95–130Exceptional

A85–94Strong

B70–84Competent

C55–69Developing

D40–54Weak

F0–39Critical