For AI Engineers & Researchers
Report benchmark defects without leaving your terminal.
Major RL benchmarks are riddled with broken tests, incorrect gold answers, and fragile environments, so your agent gets marked wrong on tasks that were never solvable. Delphik makes reporting them effortless, and gives you verifiable credit when maintainers fix them.
→ Reading your last run… swe-bench / django__django-16139
→ Gold test looks wrong. Your patch passes but is scored as failing.
→ Report this defect? (press Enter to confirm)
✓ Reported · under review · track at posttrain.dev/benchmarks/swebench-verified
The Problem
Every benchmark has defects. None fully disclose them.
The best benchmarks are actively maintained. SWE-bench and Terminal-Bench ship fixes as defects surface. But the defects keep coming: OpenAI's audit found 68% of SWE-bench tasks defective, and even its curated SWE-bench Verified couldn't stay clean. Catching them stays scattered and slow. Reporting one means hunting down the right repo or Discord, with no shared place to track the fix.
Sources: OpenAI, SWE-bench Verified (2024); Zhu et al., Establishing Best Practices for Building Rigorous Agentic Benchmarks (2025).
The Loop
One report, tracked all the way to a fixed benchmark
Most defect reports vanish into an issue tracker. The hub follows yours from the moment you file it until the fix actually ships, and credits you when it does.
Report
One command, run attached as evidence
Route
Dedupe, drop noise, send to the repo
Upstream fix
Maintainer merges a PR or links a fix commit
Verify
Confirm merge evidence and affected tasks
Publish
Credit lands and the public health record updates
Works with your existing tools
Report your first defect
Add the skill to your coding agent, then run /report-defect.
npx skills add delphik-ai/delphik --skill report-defect/report-defectFor Benchmark Owners
Receive defect reports from the people who use your benchmark
Academic labs and eval vendors: The hub routes verified defect reports from the engineers who actually run your benchmark, straight to your team. Claim your benchmark to receive reports and show the community it's actively maintained.
Continuous auditing
Receive verified defect reports from engineers who actually run your benchmark, routed directly to your team.
Fix faster, fix right
Reporters often attach fix suggestions and trajectories, viewable in our built-in Docent, so you can confirm the root cause and ship the patch fast.
Public health record
Show users the open defect history, current risk, and upstream fix evidence for your benchmark.
The Badge
Wear a quality signal your users actually trust
The Benchmark Health badge shows your benchmark is openly audited by the community, and that you fix what gets found. Drop it in your README to prove your eval is maintained, not just published.
🔴 open defects · 🟡 fixing now · ✅ fixed, straight from the live dataset.
[](https://posttrain.dev/benchmarks/your-benchmark)Independent
The signal comes from engineers who actually run your benchmark, not a self-reported claim.
Always live
Counts update automatically as defects are reported and fixed. No manual refresh.
Tamper-proof
Rendered from live data on the hub's domain, so the numbers can't be faked or cherry-picked.
Open Data
Every defect, tracked in the open
Every reported defect is tracked in public, from first report through upstream fix. Verified-and-fixed ones become a citable open dataset with evidence, root causes, and fix status. Use it to audit benchmarks, train defect-detection models, or see what makes eval tasks fail. It grows with every contribution.
Start building your track record
Install the skill in your coding agent and report your first benchmark defect. Every verified report becomes permanent proof of your expertise.
Free to use. Sign in with GitHub, install the skill, and every report you file is credited to you when it's fixed.