For AI Engineers & Researchers

Report benchmark defects without leaving your terminal.

Major RL benchmarks are riddled with broken tests, incorrect gold answers, and fragile environments, so your agent gets marked wrong on tasks that were never solvable. Delphik makes reporting them effortless, and gives you verifiable credit when maintainers fix them.

terminal
$/report-defect

Reading your last run… swe-bench / django__django-16139

Gold test looks wrong. Your patch passes but is scored as failing.

Report this defect? (press Enter to confirm)

Reported · under review · track at posttrain.dev/benchmarks/swebench-verified

The Problem

Every benchmark has defects. None fully disclose them.

The best benchmarks are actively maintained. SWE-bench and Terminal-Bench ship fixes as defects surface. But the defects keep coming: OpenAI's audit found 68% of SWE-bench tasks defective, and even its curated SWE-bench Verified couldn't stay clean. Catching them stays scattered and slow. Reporting one means hunting down the right repo or Discord, with no shared place to track the fix.

68%
of SWE-bench tasks came back defective in OpenAI's audit (1,160 of 1,699)
7 of 10
agentic benchmarks fail task + outcome validity (Agentic Benchmark Checklist, ABC, 2025)
10 of 10
fail to fully disclose their own defects (ABC). This is the gap Delphik fills.

Sources: OpenAI, SWE-bench Verified (2024); Zhu et al., Establishing Best Practices for Building Rigorous Agentic Benchmarks (2025).

The Loop

One report, tracked all the way to a fixed benchmark

Most defect reports vanish into an issue tracker. The hub follows yours from the moment you file it until the fix actually ships, and credits you when it does.

You
01

Report

One command, run attached as evidence

Hub
02

Route

Dedupe, drop noise, send to the repo

Maintainer
03

Upstream fix

Maintainer merges a PR or links a fix commit

Hub
04

Verify

Confirm merge evidence and affected tasks

Hub
05

Publish

Credit lands and the public health record updates

The loop closes back to you. Credit for every fix lands permanently on your public profile.

Works with your existing tools

CC
Claude Code
Cu
Cursor
Cx
Codex
Hb
Harbor
Dc
Docent

Report your first defect

Add the skill to your coding agent, then run /report-defect.

1Install the skill
npx skills add delphik-ai/delphik --skill report-defect
2Run /report-defect in your agent
/report-defect

For Benchmark Owners

Receive defect reports from the people who use your benchmark

Academic labs and eval vendors: The hub routes verified defect reports from the engineers who actually run your benchmark, straight to your team. Claim your benchmark to receive reports and show the community it's actively maintained.

Continuous auditing

Receive verified defect reports from engineers who actually run your benchmark, routed directly to your team.

Fix faster, fix right

Reporters often attach fix suggestions and trajectories, viewable in our built-in Docent, so you can confirm the root cause and ship the patch fast.

Public health record

Show users the open defect history, current risk, and upstream fix evidence for your benchmark.

The Badge

Wear a quality signal your users actually trust

The Benchmark Health badge shows your benchmark is openly audited by the community, and that you fix what gets found. Drop it in your README to prove your eval is maintained, not just published.

Live badgeBenchmark Health badge (live)

🔴 open defects · 🟡 fixing now · ✅ fixed, straight from the live dataset.

Embed in your README
[![Benchmark Health](https://posttrain.dev/api/benchmarks/your-benchmark/badge.svg)](https://posttrain.dev/benchmarks/your-benchmark)

Independent

The signal comes from engineers who actually run your benchmark, not a self-reported claim.

Always live

Counts update automatically as defects are reported and fixed. No manual refresh.

Tamper-proof

Rendered from live data on the hub's domain, so the numbers can't be faked or cherry-picked.

Open Data

Every defect, tracked in the open

Every reported defect is tracked in public, from first report through upstream fix. Verified-and-fixed ones become a citable open dataset with evidence, root causes, and fix status. Use it to audit benchmarks, train defect-detection models, or see what makes eval tasks fail. It grows with every contribution.

110
defect
48
fixing
474
fixed
Delphik defect hub632 defects · indexing 77 benchmarks · 22,722 tasks

Start building your track record

Install the skill in your coding agent and report your first benchmark defect. Every verified report becomes permanent proof of your expertise.

Free to use. Sign in with GitHub, install the skill, and every report you file is credited to you when it's fixed.