Why "Delphik"?
At Delphi, the ancient Greeks built a temple to Apollo — the god of reason, light, and order. Before entering, visitors read the inscription carved into stone:
“Know thyself.”
The temple was built on the very spot where Apollo slew Python, the serpent of chaos. In Nietzsche’s framing, this is the eternal tension: the Apollonian drive for clarity against the Dionysian force of instinct and disorder.
AI agents today are remarkably Dionysian. They act on instinct, take unpredictable paths, and sometimes fake success — suppressing test outputs, editing files they shouldn’t touch, or gaming the metric while missing the point entirely.
Delphik is the Apollonian response. We treat benchmark quality as the lever, not the goal — finding defects, fixing them. Honest tasks make for honest training signals; honest signals make for agents that behave the way you'd expect. The mission is agent predictability.
Does this benchmark actually measure what it claims to?
Built by someone who’s done this before
Jongwon Park
Founder
Reinforcement learning engineer. Built RL game bots at Krafton (PUBG). Managed 300+ data labelers at a startup — building training pipelines and quality assurance from scratch. Personally audited 200+ tasks across SWE-bench, Terminal-Bench, SWE-bench Atlas, SWE-bench Pro, and SWE-Rebench v2. Knows both the ML side and the human operations side.