00 · Evidence
Every eval we publish includes the full methodology, the verbatim agent prompts, the starter codebase, the raw agent stdout, and a unified diff of every line of code each run produced. You can re-run the bench yourself and verify the numbers. We’d rather lose a sale than fudge a claim.
01 · Eval · 2026-05-01
Task: add per-org rate limiting to POST /api/v1/packets in a Hono codebase. Agent: Claude Code 2.1.126, one-shot non-interactive. Variable: only the briefing the agent reads. Identical 50-line starter codebase in all three conditions.
| Briefing | Words | Score | % | Wall-clock |
|---|---|---|---|---|
| Bare one-line prompt | 19 | 2 / 6 | 33% | 55s |
| Raw Slack thread (chat-style paste) | 168 | 5 / 6 | 83% | 1m 15s |
| Aideps-compiled Work Packet (PKT-W6EARP) | 333 | 6 / 6 | 100% | 2m 32s |
Six binary acceptance criteria — pass / fail per condition. The compiled-packet run wrote 4 unit tests, ran them (all passed), and refactored the server entry to enable testability. Neither of the other two conditions wrote tests.
02 · Methodology
Six binary acceptance criteria, derived from the compiled packet itself so all three conditions are scored against the same target — not a goalpost moved to favor the packet. Pass / fail is determined by reading the unified diff each agent produced.
03 · Reproduce
Manifest, prompts, starter codebase, raw thread, compiled packet, agent stdout, unified diffs, and per-criterion scoring — all in one repository. Apply the patch files to the starter; re-run the prompt against your own Claude Code. You should reach the same numbers.
git clone https://github.com/aideps-devs/aideps-evals cd aideps-evals/2026-05-01-rate-limit-pkt-w6earp cat SUMMARY.md # headline numbers cat scoring.md # per-criterion analysis cat runs/c3-packet/diff.patch
04 · Honest caveats
Pilot a Work Packet on a real task in your codebase. We compile, your agent runs it, you keep all artifacts. If the gain isn’t obvious within the first three packets, we refund the pilot. No call required.