May 28, 2026·3 min readLaunch

Introducing Defect Hub

Report a broken benchmark task in one command, and see it through to the fix.

You're running an eval and your agent gets marked wrong on a task it never could have solved: a broken verifier, a wrong answer key, a leaked solution. You know it's the benchmark's fault. So where do you report it? Today, nowhere that matters.

Defect Hub: report benchmark defects without leaving your terminal — Defect Hub: report a broken benchmark task from inside your coding agent, and see it through to the fix.

That gap is everywhere. Coding benchmarks decide which models ship and shape what gets trained next, yet they're full of defects: OpenAI's audit found 68% of SWE-bench tasks defective, and the Agentic Benchmark Checklist (2025) found most benchmarks fail task or outcome validity, with none fully disclosing their own defects. Maintainers can't catch everything, and every stronger agent surfaces new ones. But the people who hit them have nowhere good to report: a note in a Discord, an issue in the wrong repo, and it dies.

Defect Hub closes that loop. It lives where you already are, inside your coding agent. Install the skill once, then report in one line:

npx skills add delphik-ai/delphik --skill report-defect
/report-defect

The trajectory that exposed the bug is attached as evidence automatically, so you never leave your workflow. Add a fix suggestion and we tee it up as a ready-to-merge PR, so the maintainer's lift is near zero. There's no payout. The motivation is simpler: you're annoyed, and you want it fixed. A bug bounty for AI benchmarks, minus the bounty.

From there every report is triaged, routed upstream to the maintainer as a PR or issue, and tracked through the fix: found → fixing → fixed, credited to you when it lands. The verified, fixed defects become a citable open dataset. None of this reinvents the stack: benchmark execution runs on Harbor and trajectories render in Docent. Harbor measures whether the agent solves the task; Defect Hub measures whether the task is even correct.

What's live today:

72 benchmarks and 22,322 tasks indexed and ready to audit. Browse them at posttrain.dev/benchmarks.
Maintainers can claim a benchmark to get reports routed straight to them and keep a public health record current.

Hit a broken task today? Install the skill and run /report-defect, or browse the indexed benchmarks at posttrain.dev/benchmarks.