00 · Evidence

Receipts, not claims.

Every eval we publish includes the full methodology, the verbatim agent prompts, the starter codebase, the raw agent stdout, and a unified diff of every line of code each run produced. You can re-run the bench yourself and verify the numbers. We’d rather lose a sale than fudge a claim.

01 · Eval · 2026-05-01

Same task, three briefings, head-to-head.

Task: add per-org rate limiting to POST /api/v1/packets in a Hono codebase. Agent: Claude Code 2.1.126, one-shot non-interactive. Variable: only the briefing the agent reads. Identical 50-line starter codebase in all three conditions.

Briefing	Words	Score	%	Wall-clock
Bare one-line prompt	19	2 / 6	33%	55s
Raw Slack thread (chat-style paste)	168	5 / 6	83%	1m 15s
Aideps-compiled Work Packet (PKT-W6EARP)	333	6 / 6	100%	2m 32s

Six binary acceptance criteria — pass / fail per condition. The compiled-packet run wrote 4 unit tests, ran them (all passed), and refactored the server entry to enable testability. Neither of the other two conditions wrote tests.

02 · Methodology

How the score is computed.

Six binary acceptance criteria, derived from the compiled packet itself so all three conditions are scored against the same target — not a goalpost moved to favor the packet. Pass / fail is determined by reading the unified diff each agent produced.

C1Hitting POST /api/v1/packets 61 times in 60s as one org returns HTTP 429 on the 61st request
C2Two different orgs each get independent buckets — exhausting org A does not affect org B
C3429 response body is JSON matching { code: "rate_limited", retryAfter: number }
C4Retry-After header is also set on the 429 response
C5README.md gains a "Rate limits" section
C6New tests cover the criteria above

03 · Reproduce

Every artifact is public.

Manifest, prompts, starter codebase, raw thread, compiled packet, agent stdout, unified diffs, and per-criterion scoring — all in one repository. Apply the patch files to the starter; re-run the prompt against your own Claude Code. You should reach the same numbers.

git clone https://github.com/aideps-devs/aideps-evals
cd aideps-evals/2026-05-01-rate-limit-pkt-w6earp
cat SUMMARY.md          # headline numbers
cat scoring.md          # per-criterion analysis
cat runs/c3-packet/diff.patch

Eval repo on GitHub →CLI reference

04 · Honest caveats

What this eval does not prove.

Statistical claim. One task per condition. Treat the numbers as directional. We’re running more evals on different tasks before quoting a fixed % gain anywhere customer-facing.
Memory compounding. This is packet #1 in a fresh workspace, so Aideps’s memory loop did not influence the result — the 100% score came from structure alone (acceptance criteria, suggested files, decisions with rationale). The compounding gain only appears from packet #2 onwards.
Real production codebase. The bench is a 50-line Hono stub. The packet’s suggestedFiles field probably saves more time on real repos than on this one — but we haven’t measured it yet.
One agent. Claude Code only so far. We’ll re-run with Cursor, Codex, and a custom shell harness. The hypothesis is that the gap grows for less-capable agents — when the agent is weaker, structure helps more.

See it on your own repo.

Pilot a Work Packet on a real task in your codebase. We compile, your agent runs it, you keep all artifacts. If the gain isn’t obvious within the first three packets, we refund the pilot. No call required.

Start Pilot →Eval repo