Waitlist open

Bounty hunting, for AI benchmarks.

HackerOne for RL Envs. Reviewers get paid to find defects in RL benchmarks: wrong oracles, broken verifiers, leaky tests. Teams get their datasets audited before they ship.

Waitlist only for now. One email when paid audits open, and you get first dibs. Researchers can also contribute free on the open track.

For reviewers

Get paid to find defects.

Read the verifier, prove what's broken, write it up. The first reviewer to pinpoint the real root cause takes the bounty. No ML PhD required. Bug bounty, code review, or competitive-judging instincts map cleanly.

  1. 01

    Pick a task

    Browse the open paid queue. Each task is a single broken-or-not call on a benchmark task: instruction, repo snapshot, and verifier.

  2. 02

    Audit independently

    Open it in VS Code with our devcontainer. Read the verifier, run it, and submit your verdict with markdown evidence: file, line, before/after.

  3. 03

    Get paid

    A validator confirms the root cause. The first reviewer to nail it takes the bounty. Wise payout, weekly.

For teams shipping benchmarks

Audit your dataset before you ship.

Whether it's a private dataset you're selling to a frontier lab or an open benchmark you're about to publish, we run it through the same crowd-plus-AI audit first, so the defects surface before your users hit them.

Catch defects pre-release

Surface broken verifiers and wrong oracles before launch. Highest payoff on a freshly built benchmark, where the defect rate is steepest.

Private or open

Selling private data to a lab, or publishing in the open? Works either way. For open releases, ship the audit alongside it as a quality signal.

A report you can stand behind

Get a defect report with root causes and fixes you can cite at release, not a black-box score.

Get on the list before paid audits open.

Drop your email and we'll reach out the day before paid audits open, and the waitlist gets first dibs. Have a dataset to audit instead? Talk to us.