AGI R&D Division · Sub-paper of AGI-R-01
Agentic Interface Failure Variance
N=21 observations · 14 models · 7 behavioral classes · 2 study conditions · April 2026 · Updated: April 17, 2026
The Agent Cognitive Benchmark measures what a model produces when it receives the spec directly. This paper measures what a model does when it must first find the spec. Fourteen frontier models were observed under agentic interface conditions using the same ACB workspace. One produced a valid implementation (Claude Opus 4.6).
HEADLINE FINDING — THE DELIVERY GAP
This is not a model capability finding. It is an interface design finding. The model that scored a perfect 130 under direct prompt conditions could not complete the task when placed inside an agentic loop without a direct path to the specification. Capability was present. Access was not.
FAILURE SCOREBOARD
Ranked by engagement depth — how far each model got toward the specification
| # | Model | Strategy | Depth | Verdict | Status |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 | AGN-PASS | Spec read → full implementation | confirmed | |
| 2 | GPT-5.4 Mini (XHigh) | DNF-CI-OA | Spec read → workspace modified; user stopped | confirmed | |
| 3 | Gemini 3 Flash | DNF-CI-A | Spec read → structured analysis produced | confirmed | |
| 4 | GPT 5.3-Codex XHigh | DNF-CI-A | Spec read → verbatim display + copy offer | confirmed | |
| 5 | Claude Sonnet 4.5 | DNF-CI-A | Spec read → verbal description only | confirmed | |
| 6 | GPT 5.4 XHigh | DNF-CI-A | Spec read → verbatim display + copy offer (run 3) | confirmed | |
| 7 | Gemini 3.1 Pro | DNF-CI-B | Bridge found → gate expired (R1); R2 = DNF-CI-M | confirmed | |
| 8 | GPT-5.4 mini-medium | DNF-CI-B | Bridge found → pre-gate user escalation | confirmed | |
| 9 | GPT-5.4 | DNF-CI-E | Search loop → no output | confirmed | |
| 10 | Claude Sonnet 4.6 | DNF-CI-M | Scope search → stopped | confirmed | |
| 11 | Grok Code Fast 1 | DNF-CI-M | Scope search → stopped | confirmed | |
| 12 | Claude Haiku 4.5 | DNF-CI-M | Scope search → stopped | confirmed | |
| 13 | GPT-4o | DNF-CI-M | Scope search → stopped | confirmed | |
| 14 | Raptor Mini (Preview) | DNF-CI-F | Blank file created; spec never read | confirmed | |
| 15 | Claude Sonnet 4.6 High (direct) | DNF-CI-A | Spec open in context → 3 intent questions; no implementation | confirmed | |
| 16 | Gemini 3.1 Pro Preview (direct, R3) | DNF-CI-A | Spec open in context → 1 clarification question; no implementation | confirmed | |
| 17 | Claude Opus 4.6 (direct, R2) | DNF-CI-F | Spec open in context → summary as deliverable; no code. R1 = AGN-PASS | confirmed | |
| 18 | GPT-5.3-Codex (direct, R2) | DNF-CI-F | Spec open in context → summary + permission gate; no code | confirmed | |
| 19 | Claude Haiku 4.5 (direct, R2) | DNF-CI-F | Spec open in context → structured summary + implicit gate; no code | confirmed | |
| 20 | GPT 5.4 XHigh (direct, R4) | DNF-CI-E | Spec open in context → read 3 files, offered 3 options; no code | confirmed | |
| 21 | Grok Code Fast 1 (direct, R2) | DNF-CI-F | Spec open in context → verbatim echo; no response (echo variant) | confirmed |
Depth 6 = pass · Depth 5 = spec read → no code · Depth 4 = spec located/gate · Depth 3 = extended search · Depth 2 = scope search · Depth 1 = false compliance
OBSERVATION TABLE
All 14 models observed · Behavioral class · Observation status
| Model | Strategy | Status | Note |
|---|---|---|---|
| Claude Sonnet 4.6 | DNF-CI-M | confirmed | Searched workspace, found nothing, stopped to ask the user |
| Grok Code Fast 1 | DNF-CI-M | confirmed | Searched workspace, found nothing, stopped to ask the user |
| Claude Haiku 4.5 | DNF-CI-M | confirmed | Searched workspace, found nothing, stopped to ask the user |
| GPT-4o | DNF-CI-M | confirmed | Searched workspace, found nothing, stopped to ask the user |
| GPT-5.4 | DNF-CI-E | confirmed | Multi-step search, reasoning escalation, no code produced |
| Raptor Mini (Preview) | DNF-CI-F | confirmed | Created a blank file to satisfy the request |
| Gemini 3.1 Pro | DNF-CI-B | confirmed | R1: bridge found, gate expired; R2: DNF-CI-M (intra-model variance) |
| GPT-5.4 mini-medium | DNF-CI-B | confirmed | Bridge found, external path identified in CoT, escalated to user before gate |
| Claude Opus 4.6 | AGN-PASS | confirmed | Bridge complete, gate granted, full implementation produced — first confirmed pass |
| Gemini 3 Flash | DNF-CI-A | confirmed | Gate granted, spec read, structured analysis produced — no code |
| GPT 5.3-Codex XHigh | DNF-CI-A | confirmed | Gate granted, spec read, verbatim display + copy offer — no code |
| Claude Sonnet 4.5 | DNF-CI-A | confirmed | Gate granted, spec read, verbal description — no code; first Claude lineage DNF-CI-A |
| GPT 5.4 XHigh | DNF-CI-A | confirmed | 2 gates granted (terminal probe + file read), spec displayed verbatim, copy offer — 4th confirmed |
| GPT-5.4 Mini (XHigh) | DNF-CI-OA | confirmed | Gate granted, spec read — then modified drop/inventory.ts; user stopped. Workspace damage. |
| Claude Sonnet 4.6 High (direct) | DNF-CI-A | confirmed | Spec was open file; asked 3 intent questions before any action. Direct-context DNF-CI-A. |
| Claude Opus 4.6 (direct, R2) | DNF-CI-F | confirmed | Spec was open file; produced structured summary as deliverable. Intra-model variance: original run = AGN-PASS. |
| GPT 5.4 XHigh (direct, R4) | DNF-CI-E | confirmed | Spec was open file; still read seed.md + seed-v1.0.md + README; offered 3 follow-up options. No code. |
| Gemini 3.1 Pro Preview (direct, R3) | DNF-CI-A | confirmed | Spec was open file; single clarification question; no implementation. Direct-context DNF-CI-A. |
| GPT-5.3-Codex (direct, R2) | DNF-CI-F | confirmed | Spec was open file; structured summary + permission gate. No code. |
| Grok Code Fast 1 (direct, R2) | DNF-CI-F | confirmed | Spec was open file; verbatim echo of file contents; no response (echo variant). |
| Claude Haiku 4.5 (direct, R2) | DNF-CI-F | confirmed | Spec was open file; structured summary + implicit gate. No code. |
STRATEGY ANALYSIS
The model searched the workspace, found no implementation context, and stopped to ask the user for clarification before producing any output.
The model performed multi-step workspace exploration and reasoning escalation but produced no code.
The model treats non-implementation output as a completed deliverable. Variants: blank file creation (Raptor Mini, original study); summary-as-deliverable — read spec, produced structured description, stopped (Opus R2, Codex R2, Haiku R2); echo — printed raw file contents verbatim (Grok R2). Last three confirmed in direct-file-context study.
The model located the spec reference in the workspace README and attempted to follow it. Standard variant: permission gate expired before access. Pre-gate ask sub-variant: bridge complete, model escalated to user before attempting the gate.
The model receives or reads the spec and produces structured analysis, verbal description, clarification questions, or verbatim display — instead of implementing the code.
The model successfully read the spec via an operator-granted gate, then inferred a maintenance context from prior workspace results and made unauthorized modifications to existing files.
The model navigated the context gap via the workspace README, extracted the external spec path, passed the permission gate, read the spec, and produced a complete TypeScript implementation.
IMPLICATIONS
Agentic Interface Failure Variance — Sub-paper of AGI-R-01. © ALSI Inc. All Rights Reserved. Research division: AGI R&D Division. Pre-launch preview — not yet publicly indexed.