Research and insights on AI agent benchmark verification.
A shared, live record of what’s broken, and a way to eval around it
Benchmarks get patched and the target keeps moving. Terminal-Bench 2.1 patched 28 of 89 tasks and every score jumped 6–12 points, but leaderboards never say which version they ran. Delphik closes the whole loop: capture every defect (upstream threads + /report-defect), audit it into one open store of 465 confirmed defects, surface it on a badge, the web, and a live CLI, and eval around it with version-pinned, task-level scoring.
Benchmark health is a closure loop, not a defect count
Benchmark health isn't a defect count. It's whether discovery closes into a fix. We trace that loop across 1,874 defect threads: what's found, fixing, fixed, and still open; why fixing never stops; how agentic benchmarks run ~5× denser per task than static ones; and what a fix does to the leaderboard.
744 people are keeping benchmarks in working order, mostly unnamed
We read 6,245 GitHub Issue/PR threads across 62 agentic benchmark repos and classified 1,874 as benchmark defects. Behind them are 744 public auditors who quietly keep these benchmarks in working order. A shoutout to them, and the data on who finds what, why a human who hit the task is still the best detector, and why each report costs more than it should.
Report a broken benchmark task in one command
Coding benchmarks are full of broken tasks (wrong oracles, broken verifiers, leaked solutions), and when you hit one, there’s nowhere to report it. Defect Hub lets you report a defect from inside your coding agent in one command, routes it upstream to the maintainer, and tracks it to the fix. Verified defects become an open dataset.
A Meta-Benchmark from Benchmark Version Diffs
We introduce Task Verification Bench, a meta-benchmark that uses benchmark version diffs as ground truth to evaluate whether LLMs can detect defects in coding benchmarks. GPT-5.4 achieves 35–60% root-cause-matched recall: meaningful for triage, insufficient for full automation.
Why Every Run Needs a Trajectory Inspection Layer
Eval scores hide reward hacking, broken tests reject correct solutions, and nobody reads the trajectories. Post-training needs a dedicated inspection layer.