Open benchmark defect hub

Report and fix defects in AI benchmarks

Nearly every major coding benchmark ships with task defects — broken verifiers, wrong oracles, solution leakage. They corrupt the scores everyone relies on. Nobody is systematically managing them.

Delphik turns benchmark defects into public evidence, upstream routing, and live health records. Coding benchmarks first.

Problem

RL benchmarks are broken

RL benchmarks contain systematic defects. The training signal is corrupted at the source. Verification is never done because stronger models surface new defects. Every benchmark needs a permanent verification layer.

Defects in every major benchmark

SWE-bench, Terminal-Bench, Atlas, Pro, Rebench v2 — every one shows broken verifiers, wrong oracles, solution leakage.

Eval scores hide reward hacking

Agents learn to game broken verifiers instead of solving the task. The score goes up; the capability doesn't.

Manual verification doesn’t scale

Each task takes hours — unfamiliar repos, test logic, oracle validation. Doesn't scale to thousands.

Stronger agents surface new defects

Tasks that passed yesterday's model break under today's. Defects that were invisible become unavoidable.

New Research
Can LLMs Detect Benchmark Defects? A Meta-Benchmark from Version Diffs
We tested whether LLMs can automatically find defects in coding benchmarks. GPT-5.4 achieves 35–60% root-cause-matched recall. Accepted to ICML 2026.
Solution

Defects surfaced two ways.

Two paths on the same backbone. The Hub captures defects on public benchmarks organically from the research field; future private audit programs can use the same methodology, validator team, and evidence infrastructure for closed datasets.

What both share
01

Root-cause matching

Same defect definition on both products. We require a pinpointed root cause, not just a 'broken' verdict.

02

Same validator team

Our team makes the final call on confirmed records, validating community reports on the Hub and future private audit evidence.

03

Shared audit infrastructure

Sandboxed execution, evidence-review tooling, lifecycle dashboards. One platform under both products.

Get in touch

Talk to Delphik about your benchmark

If you build or use AI benchmarks and care about their quality, we’d love to talk. The open hub is live; private audits are a separate future program.