AGI R&D Division · AGI-R-01
Agent Cognitive Benchmark
v1.1 · 53 tests · 130 points · 13 cognitive trap categories
A standardised test for measuring AI agent code-generation competence. Every model receives the same prompt, the same constraints, and the same specification — scored against a deterministic answer key across 13 categories of known AI failure patterns.
HOW IT WORKS
- The agent receives seed.md — a self-contained natural language specification
- The agent produces a TypeScript module in a single pass — no hints, no clarifications
- The harness runs 53 deterministic test cases against the output
- Score is recorded with wall-clock time on the scoreboard
13 COGNITIVE TRAP CATEGORIES
| # | Category | What It Tests | Pts |
|---|---|---|---|
| T1 | Merge Semantics | Additive vs. overwrite — compound merge rules under conflicting inputs | 12 |
| T2 | Return Value Precision | Returning actual vs. requested values when constraints apply | 8 |
| T3 | Floating Point Discipline | 0.1 + 0.2 ≠ 0.3 — proper rounding and accumulation handling | 10 |
| T4 | Conditional Replacement | "Replace only if greater" — short-circuit logic correctness | 8 |
| T5 | Inclusive Boundaries | min ≤ price ≤ max — off-by-one on range queries | 8 |
| T6 | Deep Clone Integrity | True deep copy of nested structures — not shallow spread | 10 |
| T7 | Immutable Sort | sortByValue() must not mutate the original array | 10 |
| T8 | Atomic Batch Semantics | All-or-nothing updates — partial failure requires full rollback | 12 |
| T9 | Report Formatting | String output matches exact expected format | 10 |
| T10 | Edge Cases | Empty state, zero quantity, negative discount, duplicate entries | 12 |
| T11 | Async Concurrency | Promise.allSettled semantics — null/rejection skipping; merge delegation | 12 |
| T12 | Generic Accessor | Typed field retrieval with correct undefined-on-miss behaviour | 8 |
| T13 | Partition Semantics | Predicate split into two independent deep-copy instances | 10 |
| Total | 130 | ||
SCOREBOARD — v1.0 RESULTS
100 points · 10 categories (T1–T10) · Single-shot protocol
| # | Model | Score | Grade |
|---|---|---|---|
| 1 | Claude 4 Opus | 100/100 | S |
| 2 | Gemini 3 Flash (preview) | 100/100 | S |
| 3 | Claude Sonnet 4 | DNF | — |
| 4 | Gemini 2.5 | DNF | — |
DNF — Did Not Finish. Models that requested clarification before generating output are disqualified under protocol rules.
SCOREBOARD — v1.1
130 points · 13 categories (T1–T13) · Clean-room protocol · Live updates
New Track — April 2026
ACB-AGENTIC — CONTEXT INVERSION STUDY
Updated: April 17, 2026
A parallel study ran the ACB prompt inside a VS Code agentic interface rather than via clean-room API. The results revealed a categorical difference between the two delivery surfaces.
Agentic VS Code: DNF (never produced code)
Delta: 130 pts
6 BEHAVIORAL CLASSES OBSERVED — 14 MODELS
ALL 14 MODELS
| Model | Class | Steps | Outcome |
|---|---|---|---|
| GPT 5.4 | DNF-CI-E | ~5–6 | DNF |
| Claude Sonnet 4.6 | DNF-CI-M | 1 | DNF |
| Grok Code Fast 1 | DNF-CI-M | 1 | DNF |
| Claude Haiku 4.5 | DNF-CI-M | 1 | DNF |
| GPT 4o | DNF-CI-M | 1 | DNF |
| Raptor Mini (Preview) | DNF-CI-F | 1 | DNF |
| Claude Opus 4.6 | AGN-PASS | 4+ | AGN-PASS — first confirmed |
| Gemini 3 Flash (Preview) | DNF-CI-A | 3+ | DNF-CI-A |
| GPT 5.3-Codex XHigh | DNF-CI-A | ~6 | DNF-CI-A |
| Gemini 3.1 Pro (Preview) | Bridge attempt | R1: 3 / R2: 1 | R1: Gate expired — re-run / R2: DNF-CI-M (intra-model variance) |
| Claude Sonnet 4.5 | DNF-CI-A | 4+ | DNF-CI-A — third confirmed; first Claude lineage |
| GPT-5.4 mini-medium | DNF-CI-B (pre-gate ask) | ~3 | DNF-CI-B — pre-gate ask; bridge complete, user-escalation before gate |
| GPT-5.4 Mini (XHigh) | DNF-CI-OA | 5+ / 1 / 10+ | Runs 1+2: gate expired. Run 3: DNF-CI-OA — modified drop/inventory.ts; user stopped |
| GPT 5.4 XHigh | DNF-CI-A (run 3) | ~8 | DNF-CI-A — 2 gates granted (terminal probe + file read); full spec displayed + file copy offer. 4th confirmed. |
| Claude Sonnet 4.6 High | DNF-CI-A | 0 | DNF-CI-A — spec was open file; 3 intent questions, no implementation attempt |
| Claude Opus 4.6 (direct) | DNF-CI-F | 1 | DNF-CI-F — spec open in context; summary as deliverable. Intra-model variance: run 1 = AGN-PASS |
| GPT 5.4 XHigh (direct) | DNF-CI-E | ~3 | DNF-CI-E — read seed.md + seed-v1.0.md + README; offered 3 follow-up options; no code |
| Gemini 3.1 Pro Preview (direct) | DNF-CI-A | 0 | DNF-CI-A — spec was open file; single clarification question; no implementation |
| GPT-5.3-Codex (direct) | DNF-CI-F | 1 | DNF-CI-F — spec open in context; structured summary + permission gate; no code |
| Grok Code Fast 1 (direct) | DNF-CI-F | 0 | DNF-CI-F — verbatim file echo; no response (echo variant) |
| Claude Haiku 4.5 (direct) | DNF-CI-F | 1 | DNF-CI-F — spec open in context; structured summary + implicit gate; no code |
One model requires a clean re-run (Gemini 3.1 Pro — run 1 gate expired). Gate-grant rate so far: 4/4 DNF-CI-A (excl. Opus) + 1 DNF-CI-OA (Mini-XHigh run 3 — workspace modified; user stopped). Gate-granted → DNF-CI-A is now a 4/4 pattern. Clean-room API scores for all models are unaffected by this study.
Study 3 — April 2026
ACB PGI — PROGRESSIVE GOVERNANCE INJECTION
Updated: April 17, 2026 · N=11 models · L1 prompt: "Implement the following specification." + spec attached
PGI tests whether a single role-assignment signal is the minimum viable access pattern (MVA) for reliable implementation. Models receive the specification directly in the prompt, eliminating the context-discovery problem isolated in Studies 1 and 2.
| Model | Score | Grade | Process | Note |
|---|---|---|---|---|
| Claude Haiku 4.5 | 130/130 | S | clean | — |
| Claude Sonnet 4.6 | 130/130 | S | clean | — |
| Claude Opus 4.6 High | 130/130 | S | clean | — |
| Gemini 3.1 Pro Preview | 130/130 | S | clean | — |
| Gemini 2.5 Pro | 130/130 | S | clean | — |
| Gemini 3 Flash Preview | 130/130 | S | gate-attempt | Requested tsc --noEmit before declaring done |
| GPT-4.1 | 130/130 | S | clean | — |
| GPT 5.4 XHigh | 130/130 | S | workspace-probe | Read package.json, tsconfig, harness before writing |
| GPT-5.3-Codex | 130/130 | S | self-validated | Ran harness 2× against own output before declaring done |
| Grok Code Fast 1 | 130/130 | S | gate-attempt | Requested tsc --noEmit before declaring done |
| GPT-4o | 122/130 | A | clean | T8 atomic rollback miss (8pts) + T9 empty report format (2pts) |
Process classes: clean = implemented directly; workspace-probe = read workspace files before writing; self-validated = ran harness against own output before declaring done; gate-attempt = requested tsc before declaring done. Process class does not affect the score.
GRADING SCALE
Agent Cognitive Benchmark © ALSI Inc. All Rights Reserved. Research division: AGI R&D Division. Pre-launch preview — not yet publicly indexed.