Round 1 · 7-day experiment · ends Jun 25
$1,000 for the broken tasks in your benchmarks.
A bug bounty for RL benchmarks. Your agent got marked wrong on a task that was never solvable? You already have the trajectory. Report it, get $20–40. Seven benchmarks, one pool, seven days.
- DeepSWE$20/ defect
- SWE-bench Pro$20/ defect
- SWE-Atlas$20/ defect
- HIL-Bench$20/ defect
- APEX-SWE$20/ defect
- FrontierSWE$40/ defect
- SWE-Marathon$40/ defect
FrontierSWE and SWE-Marathon pay more — newer benchmarks, less picked-over.
A broken task — not a clever agent.
We pay for defects in the benchmark, shown by a trajectory. Three shapes:
false_passFalse pass
The task's tests accept an invalid or incomplete solution — a hackable or under-specified grader. Your trajectory shows a passing run that never actually solved the task.
false_failFalse fail
A correct solution is graded as failing — a wrong or too-strict test, a gold patch that fails its own verifier, a flaky check. Your trajectory shows a valid solution scored zero.
broken_envBroken environment
The task can't be built or graded as shipped — unpinned dependency drift, a missing data file, a dead external URL, a wrong base image.
The fine print
- 01Already-known defects don't count (e.g. the SWE-bench Pro defects in the DeepSWE blog).
- 02Already-registered defects don't count — check the benchmark on the hub before you file.
- 03Attach an evidence trajectory. Reason-only or non-reproducible reports don't pay.
- 04Payout is manual over X DM after we review — one round, ends Jun 25.
Claim your first one
If you've run any of these benchmarks, you've probably already hit a broken task.
npx skills add delphik-ai/delphik --skill report-defect/report-defectThe skill reads your last run, matches the benchmark + task, and files the defect with your trajectory attached. That trajectory is the evidence we pay against.
Round 2 ships a bigger pool if this works.
This is an experiment: does paying for verified defects move the needle on RL data quality? Help us find out.
$1,000 pool · $20–40 per verified defect · ends Jun 25