Can LLMs Detect Benchmark Defects? A Meta-Benchmark from Benchmark Version Diffs

The Problem

Coding agent benchmarks are the foundation of AI progress in software engineering. They drive training signal for RL, provide evaluation metrics for model releases, and shape research priorities. But these benchmarks themselves contain bugs.

SWE-bench: OpenAI audited 1,699 tasks with 3 annotators each. 68% were flagged as defective: underspecified problem statements, tests that reject valid solutions, and other issues. This led to SWE-bench Verified (500 curated tasks).
Terminal-Bench 2: 31% of tasks were revised via community and author fix PRs, covering weak tests, misleading artifacts, and environment issues.
Terminal-Bench 1: 108 community fix PRs covering 129 defect entries across 229 tasks as of May 2026.

These defects persist not from lack of effort but because manual auditing doesn't scale. And stronger agents surface new defects. When frontier models saturated SWE-bench Verified, auditing the remaining failures revealed that 59% stemmed from flawed tests, not model limitations.

Task Verification Bench

We introduce Task Verification Bench (TVB), a meta-benchmark that repurposes benchmark version diffs as ground truth. When benchmark authors release v2, they implicitly label each modified task as “defective in v1.”

We collected labels from three benchmark lineages:

SWE-bench: 1,160 defect tasks from 3-annotator audit (after filtering: 129 defect + 129 valid = 258 tasks)
Terminal-Bench 2: 26 defect + 26 valid = 52 tasks
Terminal-Bench 1: 65 defect + 56 valid = 121 tasks (after GT noise filtering)

Ground-truth annotations: SWE-bench's 3-annotator notes and Terminal-Bench's description + fix_summary — Ground truth comes from author-acknowledged version diffs: SWE-bench's 3-annotator notes (left) and Terminal-Bench's description + fix_summary (right).

Given only the original task artifacts (problem statement, test suite, gold patch), the LLM agent must determine whether each task contains a defect and identify the correct root cause.

Root-Cause Matching

A key design choice: we evaluate with root-cause matching, not just binary detection.

An agent that flags a task as defective but for the wrong reason provides no actionable signal for fixing it. We require the agent's identified defect to match the ground truth annotation. An LLM judge compares the agent's notes against GT and asks: would fixing the agent's identified defect also fix the GT defect?

This distinction matters enormously. Verdict-level recall overestimates true detection by up to 54 percentage points.

The TVB evaluation pipeline from v1 task to root-cause-matched result — The TVB pipeline: given only the v1 task, the agent returns a verdict and a root cause; an LLM judge asks whether fixing the agent's defect would also fix the ground-truth defect.

Key Results

Cross-Benchmark Baseline (GPT-5.4, verify-only with sandbox)

	SWE-bench	TB2	TB1
Verdict Recall	62.0%	88.5%	90.8%
RC Recall	59.7%	34.6%	41.5%
Verdict→RC Gap	2pp	54pp	49pp

35–60% root-cause-matched recall: meaningful for triage, but far from fully automated auditing.

Verdict-level recall vs root-cause recall across SWE-bench, Terminal-Bench 2, and Terminal-Bench 1 — Verdict-level recall looks high, but root-cause recall is what counts. On the interactive benchmarks the gap runs 49–54 points.

Self-Attribution Anchoring

Pipeline ablation on SWE-bench uncovered a surprising effect. When GPT-5.4 solves a task and then verifies with sandbox access to its own failed code, it blames itself rather than the benchmark, dropping RC recall from 78.3% to 53.9%.

We call this self-attribution anchoring. The agent reviews its failed patch, attributes the test failure to its own implementation, and dismisses defects it would otherwise flag.

Gemini 3.1 Pro resists this effect entirely, maintaining stable 70–76% RC recall across all conditions. Model choice matters more than pipeline design.

Root-cause recall by pipeline condition for GPT-5.4 vs Gemini 3.1 Pro — Letting GPT-5.4 solve the task first and verify with code access drags its root-cause recall down (78.3 → 53.9); Gemini 3.1 Pro stays flat across every condition.

Text-Layer vs Infrastructure-Layer

The agent excels at text-layer defects, comparing problem statements against test assertions to find explicit mismatches. But it exhibits an infrastructure-layer blind spot: environment issues, Dockerfile leakage, and oracle regressions require longer reasoning chains that the agent consistently fails to follow.

All 6 sampled true positive cases involve text-layer defects. All 8 Terminal-Bench miss/mismatch cases involve infrastructure-layer defects.

Implications

LLM-assisted auditing can meaningfully reduce human review effort for text-layer defects. But infrastructure-layer verification remains a challenge requiring human expertise.

The optimal workflow is agent-first triage followed by targeted human review. LLMs prioritize where humans should look, rather than replacing them.

This is the workflow we're building at Delphik.

Read the Paper

“Can LLMs Detect Benchmark Defects? A Meta-Benchmark from Benchmark Version Diffs”
Accepted to the DL4C Workshop @ ICML 2026.

Download Paper (PDF)

Task Verification Bench dataset and evaluation code will be released upon publication.

About the Author

Jongwon Park is the founder of Delphik, building eval infrastructure for AI coding agents. Previously an RL research engineer at Krafton (PUBG) and AI lead managing 300+ data labelers.

LinkedIn · posttrain.dev