Round 1 · 7-day experiment · ends Jun 25

$1,000 for the broken tasks in your benchmarks.

A bug bounty for RL benchmarks. Your agent got marked wrong on a task that was never solvable? You already have the trajectory. Report it, get $20–40. Seven benchmarks, one pool, seven days.

bounty.ledger
  • DeepSWE$20/ defect
  • SWE-bench Pro$20/ defect
  • SWE-Atlas$20/ defect
  • HIL-Bench$20/ defect
  • APEX-SWE$20/ defect
  • FrontierSWE$40/ defect
  • SWE-Marathon$40/ defect
Pool$1,000

FrontierSWE and SWE-Marathon pay more — newer benchmarks, less picked-over.

A broken task — not a clever agent.

We pay for defects in the benchmark, shown by a trajectory. Three shapes:

false_pass

False pass

The task's tests accept an invalid or incomplete solution — a hackable or under-specified grader. Your trajectory shows a passing run that never actually solved the task.

false_fail

False fail

A correct solution is graded as failing — a wrong or too-strict test, a gold patch that fails its own verifier, a flaky check. Your trajectory shows a valid solution scored zero.

broken_env

Broken environment

The task can't be built or graded as shipped — unpinned dependency drift, a missing data file, a dead external URL, a wrong base image.

The fine print

  • 01Already-known defects don't count (e.g. the SWE-bench Pro defects in the DeepSWE blog).
  • 02Already-registered defects don't count — check the benchmark on the hub before you file.
  • 03Attach an evidence trajectory. Reason-only or non-reproducible reports don't pay.
  • 04Payout is manual over X DM after we review — one round, ends Jun 25.

Claim your first one

If you've run any of these benchmarks, you've probably already hit a broken task.

1Install the skill in your coding agent
npx skills add delphik-ai/delphik --skill report-defect
2Run it on the run that got marked wrong
/report-defect

The skill reads your last run, matches the benchmark + task, and files the defect with your trajectory attached. That trajectory is the evidence we pay against.

Round 2 ships a bigger pool if this works.

This is an experiment: does paying for verified defects move the needle on RL data quality? Help us find out.

$1,000 pool · $20–40 per verified defect · ends Jun 25