00 · Evidence

Receipts, not claims.

Every eval we publish includes the full methodology, the verbatim agent prompts, the starter codebase, the raw agent stdout, and a unified diff of every line of code each run produced. You can re-run the bench yourself and verify the numbers. We’d rather lose a sale than fudge a claim.

01 · Eval · 2026-05-01

Same task, three briefings, head-to-head.

Task: add per-org rate limiting to POST /api/v1/packets in a Hono codebase. Agent: Claude Code 2.1.126, one-shot non-interactive. Variable: only the briefing the agent reads. Identical 50-line starter codebase in all three conditions.

BriefingWordsScore%Wall-clock
Bare one-line prompt192 / 633%55s
Raw Slack thread (chat-style paste)1685 / 683%1m 15s
Aideps-compiled Work Packet (PKT-W6EARP)3336 / 6100%2m 32s

Six binary acceptance criteria — pass / fail per condition. The compiled-packet run wrote 4 unit tests, ran them (all passed), and refactored the server entry to enable testability. Neither of the other two conditions wrote tests.

02 · Methodology

How the score is computed.

Six binary acceptance criteria, derived from the compiled packet itself so all three conditions are scored against the same target — not a goalpost moved to favor the packet. Pass / fail is determined by reading the unified diff each agent produced.

  1. C1Hitting POST /api/v1/packets 61 times in 60s as one org returns HTTP 429 on the 61st request
  2. C2Two different orgs each get independent buckets — exhausting org A does not affect org B
  3. C3429 response body is JSON matching { code: "rate_limited", retryAfter: number }
  4. C4Retry-After header is also set on the 429 response
  5. C5README.md gains a "Rate limits" section
  6. C6New tests cover the criteria above

03 · Reproduce

Every artifact is public.

Manifest, prompts, starter codebase, raw thread, compiled packet, agent stdout, unified diffs, and per-criterion scoring — all in one repository. Apply the patch files to the starter; re-run the prompt against your own Claude Code. You should reach the same numbers.

git clone https://github.com/aideps-devs/aideps-evals
cd aideps-evals/2026-05-01-rate-limit-pkt-w6earp
cat SUMMARY.md          # headline numbers
cat scoring.md          # per-criterion analysis
cat runs/c3-packet/diff.patch
Eval repo on GitHub →CLI reference

04 · Honest caveats

What this eval does not prove.

See it on your own repo.

Pilot a Work Packet on a real task in your codebase. We compile, your agent runs it, you keep all artifacts. If the gain isn’t obvious within the first three packets, we refund the pilot. No call required.

Start Pilot →Eval repo