For AI Engineers & Researchers

Report benchmark defects without leaving your terminal.

Major RL benchmarks are riddled with broken tests, incorrect gold answers, and fragile environments, so your agent gets marked wrong on tasks that were never solvable. Delphik makes reporting them effortless, and gives you verifiable credit when maintainers fix them.

Browse the defect hub Install the skill

terminal

$/report-defect

→ Reading your last run… swe-bench / django__django-16139

→ Gold test looks wrong. Your patch passes but is scored as failing.

→ Report this defect? (press Enter to confirm)

✓ Reported · under review · track at posttrain.dev/benchmarks/swebench-verified

The Problem

Every benchmark has defects. None fully disclose them.

The best benchmarks are actively maintained. SWE-bench and Terminal-Bench ship fixes as defects surface. But the defects keep coming: OpenAI's audit found 68% of SWE-bench tasks defective, and even its curated SWE-bench Verified couldn't stay clean. Catching them stays scattered and slow. Reporting one means hunting down the right repo or Discord, with no shared place to track the fix.

68%

of SWE-bench tasks came back defective in OpenAI's audit (1,160 of 1,699)

7 of 10

agentic benchmarks fail task + outcome validity (Agentic Benchmark Checklist, ABC, 2025)

10 of 10

fail to fully disclose their own defects (ABC). This is the gap Delphik fills.

Sources: OpenAI, SWE-bench Verified (2024); Zhu et al., Establishing Best Practices for Building Rigorous Agentic Benchmarks (2025).

The Loop

One report, tracked all the way to a fixed benchmark

Most defect reports vanish into an issue tracker. The hub follows yours from the moment you file it until the fix actually ships, and credits you when it does.

You

Report

One command, run attached as evidence

Hub

Route

Dedupe, drop noise, send to the repo

Maintainer

Upstream fix

Maintainer merges a PR or links a fix commit

Hub

Verify

Confirm merge evidence and affected tasks

Hub

Publish

Credit lands and the public health record updates

The loop closes back to you. Credit for every fix lands permanently on your public profile.

Works with your existing tools

Claude Code

Cursor

Codex

Harbor

Docent

Report your first defect

Add the skill to your coding agent, then run /report-defect.

1Install the skill

npx skills add delphik-ai/delphik --skill report-defect

2Run /report-defect in your agent

/report-defect

View on GitHub

For Benchmark Owners

Receive defect reports from the people who use your benchmark

Academic labs and eval vendors: The hub routes verified defect reports from the engineers who actually run your benchmark, straight to your team. Claim your benchmark to receive reports and show the community it's actively maintained.

Continuous auditing

Receive verified defect reports from engineers who actually run your benchmark, routed directly to your team.

Fix faster, fix right

Reporters often attach fix suggestions and trajectories, viewable in our built-in Docent, so you can confirm the root cause and ship the patch fast.

Public health record

Show users the open defect history, current risk, and upstream fix evidence for your benchmark.

The Badge

Wear a quality signal your users actually trust

The Benchmark Health badge shows your benchmark is openly audited by the community, and that you fix what gets found. Drop it in your README to prove your eval is maintained, not just published.

Claim your benchmark

Live badge

🔴 open defects · 🟡 fixing now · ✅ fixed, straight from the live dataset.

Embed in your README

[![Benchmark Health](https://posttrain.dev/api/benchmarks/your-benchmark/badge.svg)](https://posttrain.dev/benchmarks/your-benchmark)

Independent

The signal comes from engineers who actually run your benchmark, not a self-reported claim.

Always live

Counts update automatically as defects are reported and fixed. No manual refresh.

Tamper-proof

Rendered from live data on the hub's domain, so the numbers can't be faked or cherry-picked.

Open Data

Every defect, tracked in the open

Every reported defect is tracked in public, from first report through upstream fix. Verified-and-fixed ones become a citable open dataset with evidence, root causes, and fix status. Use it to audit benchmarks, train defect-detection models, or see what makes eval tasks fail. It grows with every contribution.

110

defect

fixing

474

fixed

Browse the dataset

Delphik defect hub632 defects · indexing 77 benchmarks · 22,722 tasks

DeepSWEdefect

narwhals-rolling-window-suite·GitHub

Across the 113-task DeepSWE set, 8 tasks' official solution.patch fails grading: 6 from unpinned dependency drift in the task Dockerfile (newer deps break unrelated base tests under filterwarnings=error; e.g. polars 1.40 in narwhals-rolling-window-suite, confirmed fixed by downgrading to 1.39.3) and 2 where the official solution deterministically fails the hidden new tests (mnamer-daemon-watch-lifecycle, helm-unified-manifest-stream).

financeagentdefect

common·GitHub

retrieve_information interpolates prompts with str.format(), so ordinary braces in prompts (e.g. JSON output schemas or dictionary examples) are treated as format fields and raise KeyError before the LLM call. The fix replaces this with direct {{key}} substitution.

arc_agi_2defect

a6f40cea_0·GitHub

Public evaluation task a6f40cea has an off-by-one (striped-frame phase) inconsistency in train pair 2's output grid. The maintainer acknowledged it as an unintentional error, not prioritized for fix because it sits in a train example.

DeepSWEdefect

common·GitHub

DeepSWE rewards base exit 0 AND new exit 0, but 69/113 tasks collect base tests via directory wildcards (pytest tests/, go test ./..., cargo test) and the reset step restores only test.patch-touched files. A model that solves the task can still score 0 if it leaves any extra/broken test the wildcard collects -- a false negative. Proposal: ship FAIL_TO_PASS/PASS_TO_PASS lists and grade via log parsing (SWE-bench style).

DeepSWEdefect

quill-shared-toolbar-focus·GitHub

The quill-shared-toolbar-focus grader asserts getAttribute('aria-disabled') is exactly the string 'false' after re-enabling the editor, rejecting the standards-compliant approach of removing the attribute (which yields null). A correct, accessibility-friendly solution fails the restore test.

FrontierSWEdefect

notebook-compression·GitHub

Task notebook-compression cannot be set up: instruction.md expects the data at $DATA_ROOT/visible, but that directory does not exist in the task's Docker environment.

Benchmark	Task	Defect	Status	Source
DeepSWE	narwhals-rolling-window-suite	Across the 113-task DeepSWE set, 8 tasks' official solution.patch fails grading: 6 from unpinned dependency drift in the task Dockerfile (newer deps break unrelated base tests under filterwarnings=error; e.g. polars 1.40 in narwhals-rolling-window-suite, confirmed fixed by downgrading to 1.39.3) and 2 where the official solution deterministically fails the hidden new tests (mnamer-daemon-watch-lifecycle, helm-unified-manifest-stream).	defect	GitHub
financeagent	common	retrieve_information interpolates prompts with str.format(), so ordinary braces in prompts (e.g. JSON output schemas or dictionary examples) are treated as format fields and raise KeyError before the LLM call. The fix replaces this with direct {{key}} substitution.	defect	GitHub
arc_agi_2	a6f40cea_0	Public evaluation task a6f40cea has an off-by-one (striped-frame phase) inconsistency in train pair 2's output grid. The maintainer acknowledged it as an unintentional error, not prioritized for fix because it sits in a train example.	defect	GitHub
DeepSWE	common	DeepSWE rewards base exit 0 AND new exit 0, but 69/113 tasks collect base tests via directory wildcards (pytest tests/, go test ./..., cargo test) and the reset step restores only test.patch-touched files. A model that solves the task can still score 0 if it leaves any extra/broken test the wildcard collects -- a false negative. Proposal: ship FAIL_TO_PASS/PASS_TO_PASS lists and grade via log parsing (SWE-bench style).	defect	GitHub
DeepSWE	quill-shared-toolbar-focus	The quill-shared-toolbar-focus grader asserts getAttribute('aria-disabled') is exactly the string 'false' after re-enabling the editor, rejecting the standards-compliant approach of removing the attribute (which yields null). A correct, accessibility-friendly solution fails the restore test.	defect	GitHub
FrontierSWE	notebook-compression	Task notebook-compression cannot be set up: instruction.md expects the data at $DATA_ROOT/visible, but that directory does not exist in the task's Docker environment.	defect	GitHub

Start building your track record

Install the skill in your coding agent and report your first benchmark defect. Every verified report becomes permanent proof of your expertise.

Install the Skill Browse benchmarks

Free to use. Sign in with GitHub, install the skill, and every report you file is credited to you when it's fixed.