Report and fix defects in AI benchmarks
Nearly every major coding benchmark ships with task defects — broken verifiers, wrong oracles, solution leakage. They corrupt the scores everyone relies on. Nobody is systematically managing them.
Delphik turns benchmark defects into public evidence, upstream routing, and live health records. Coding benchmarks first.
RL benchmarks are broken
RL benchmarks contain systematic defects. The training signal is corrupted at the source. Verification is never done because stronger models surface new defects. Every benchmark needs a permanent verification layer.
Defects in every major benchmark
SWE-bench, Terminal-Bench, Atlas, Pro, Rebench v2 — every one shows broken verifiers, wrong oracles, solution leakage.
Eval scores hide reward hacking
Agents learn to game broken verifiers instead of solving the task. The score goes up; the capability doesn't.
Manual verification doesn’t scale
Each task takes hours — unfamiliar repos, test logic, oracle validation. Doesn't scale to thousands.
Stronger agents surface new defects
Tasks that passed yesterday's model break under today's. Defects that were invisible become unavoidable.
Defects surfaced two ways.
Two paths on the same backbone. The Hub captures defects on public benchmarks organically from the research field; future private audit programs can use the same methodology, validator team, and evidence infrastructure for closed datasets.
Delphik defect hub
Anyone reports a defect from their terminal. We triage and route upstream.
Private Audit Programs
Future private audits for teams that need dedicated routing, review, and release evidence.
Root-cause matching
Same defect definition on both products. We require a pinpointed root cause, not just a 'broken' verdict.
Same validator team
Our team makes the final call on confirmed records, validating community reports on the Hub and future private audit evidence.
Shared audit infrastructure
Sandboxed execution, evidence-review tooling, lifecycle dashboards. One platform under both products.
Blog
All posts- 6 min readProductThe Defect Loop: Capture, Audit, Surface, Re-eval
Benchmarks get patched and the target keeps moving. Terminal-Bench 2.1 patched 28 of 89 tasks and every score jumped 6–12 points, but leaderboards never say which version they ran. Delphik closes the whole loop: capture every defect (upstream threads + /report-defect), audit it into one open store of 465 confirmed defects, surface it on a badge, the web, and a live CLI, and eval around it with version-pinned, task-level scoring.
- 7 min readDataA Benchmark With No Reported Defects Isn't Clean. It's Unaudited.
Benchmark health isn't a defect count. It's whether discovery closes into a fix. We trace that loop across 1,874 defect threads: what's found, fixing, fixed, and still open; why fixing never stops; how agentic benchmarks run ~5× denser per task than static ones; and what a fix does to the leaderboard.
- 6 min readCommunityDataThe Unsung Heroes Fixing Agentic Benchmarks
We read 6,245 GitHub Issue/PR threads across 62 agentic benchmark repos and classified 1,874 as benchmark defects. Behind them are 744 public auditors who quietly keep these benchmarks in working order. A shoutout to them, and the data on who finds what, why a human who hit the task is still the best detector, and why each report costs more than it should.
Talk to Delphik about your benchmark
If you build or use AI benchmarks and care about their quality, we’d love to talk. The open hub is live; private audits are a separate future program.