Blog

Research and insights on AI agent benchmark verification.

·6 min readProduct

The Defect Loop: Capture, Audit, Surface, Re-eval

A shared, live record of what’s broken, and a way to eval around it

Benchmarks get patched and the target keeps moving. Terminal-Bench 2.1 patched 28 of 89 tasks and every score jumped 6–12 points, but leaderboards never say which version they ran. Delphik closes the whole loop: capture every defect (upstream threads + /report-defect), audit it into one open store of 465 confirmed defects, surface it on a badge, the web, and a live CLI, and eval around it with version-pinned, task-level scoring.

·7 min readData

A Benchmark With No Reported Defects Isn't Clean. It's Unaudited.

Benchmark health is a closure loop, not a defect count

Benchmark health isn't a defect count. It's whether discovery closes into a fix. We trace that loop across 1,874 defect threads: what's found, fixing, fixed, and still open; why fixing never stops; how agentic benchmarks run ~5× denser per task than static ones; and what a fix does to the leaderboard.

·6 min readCommunityData

The Unsung Heroes Fixing Agentic Benchmarks

744 people are keeping benchmarks in working order, mostly unnamed

We read 6,245 GitHub Issue/PR threads across 62 agentic benchmark repos and classified 1,874 as benchmark defects. Behind them are 744 public auditors who quietly keep these benchmarks in working order. A shoutout to them, and the data on who finds what, why a human who hit the task is still the best detector, and why each report costs more than it should.

·3 min readLaunch

Introducing Defect Hub

Report a broken benchmark task in one command

Coding benchmarks are full of broken tasks (wrong oracles, broken verifiers, leaked solutions), and when you hit one, there’s nowhere to report it. Defect Hub lets you report a defect from inside your coding agent in one command, routes it upstream to the maintainer, and tracks it to the fix. Verified defects become an open dataset.

·5 min readResearchICML 2026

Can LLMs Detect Benchmark Defects?

A Meta-Benchmark from Benchmark Version Diffs

We introduce Task Verification Bench, a meta-benchmark that uses benchmark version diffs as ground truth to evaluate whether LLMs can detect defects in coding benchmarks. GPT-5.4 achieves 35–60% root-cause-matched recall: meaningful for triage, insufficient for full automation.

·10 min readOpinion

Post-Training Is Flying Blind

Why Every Run Needs a Trajectory Inspection Layer

Eval scores hide reward hacking, broken tests reject correct solutions, and nobody reads the trajectories. Post-training needs a dedicated inspection layer.