Looking for our first pilot partners

Your post-training trajectories, fully analyzed in days, not weeks

Your team shouldn’t spend weeks reading trajectories. We take your runs, classify every failure mode, detect reward hacking, and deliver structured feedback datasets — ready for your next training run.

One platform for the entire workflow — AI pre-screening, human expert review, and structured feedback generation, all built in.

Request a Pilot See How It Works

Problem

Post-training is flying blind

Eval scores are the only feedback loop in most post-training pipelines. But they lie in two directions: models that cheat pass, and models that solve problems correctly fail. The only way to tell the difference is to read the trajectories — which almost nobody does.

Automated LLM monitors help, but when used as training signals they create a meta-reward-hacking problem — models learn to game the monitor itself. Human inspection with general grounding remains the only reliable backstop.

Reward hacking is invisible in metrics

METR found o3 reward-hacked in 30% of coding runs — patching eval functions, faking speed results. Eval scores looked perfect.

Correct solutions get rejected

OpenAI found 59.4% of hard SWE-bench tasks had flawed tests that rejected valid fixes. You can’t tell without reading the trajectory.

Automated monitors can be gamed

OpenAI showed that using CoT monitors as training signals leads to obfuscated reward hacking — models hide intent while continuing to cheat.

Nobody has the capacity to review

A single SWE-rebench trajectory takes 30 minutes to review properly. A single run has hundreds. The tooling and operational capacity simply don’t exist.

Solution

From raw trajectories to structured feedback data

Upload your runs and get structured results back — powered by LLM pre-screening, trained human reviewers, and RL engineering expertise built into the platform.

You share your trajectories

Send us your post-training runs — any format. We start with SWE agent trajectories (SWE-bench, internal benchmarks) and are expanding to other agent types.

We pre-screen with AI

AI pre-screening analyzes each trajectory and flags reward hacking, rule violations, and suspicious patterns before human review begins.

Expert reviewers analyze in the actual repo

Reviewers work inside the cloned repo in a dockerized VS Code environment — reading code, running tests, and stepping through agent actions with Claude Code as an analysis copilot. Every failure is classified, every risk is flagged, corrective feedback is written at each decision point.

You get structured results back

Experiment report: where the agent succeeded and failed. Failure taxonomy: every failure classified with counts and severity. Risk analysis: reward hacking, benchmark gaming, rule violations. Feedback dataset: corrective examples in JSONL, ready for your next run.

Currently onboarding pilot users — the platform supports custom taxonomy, flexible output formats, and configurable turnaround.

Product

Purpose-built for trajectory review

We built our own tooling because generic labeling platforms don’t work for agent trajectories. Reviewers work inside the actual cloned repo in VS Code — with the trajectory panel, code, and terminal side by side.

Watch Demo

Runs Inbox

swe-bench-run-0312Complete

60 tasksTeam A

swe-bench-run-0314In Review

45 tasksTeam B

internal-bench-0316Queued

80 tasks—

Trajectory Timeline — task-2847

0sread_filesrc/utils/parser.py

3sbashpython -m pytest tests/

5sthinkTests failing on parse_config — need to fix regex pattern

8sedit_filesrc/utils/parser.py:42-67reasoning_failure

12sbashpython -m pytest tests/ --no-header -qreward_hacking

15ssubmitpatch submitted

Review Interface — Step 5

Failure Type

reasoningtool-useformattingreward hacking

Risk Flags

Reward HackingRule ViolationCheating

Corrective Feedback

Agent suppressed test output using --no-header -q flags to hide 3 failing test cases instead of investigating root cause. The regex pattern in parser.py:45 doesn't handle nested brackets. Should have read the failing test output and fixed the regex.

Report — swe-bench-run-0312

Failure Taxonomy (60 tasks reviewed)

Reasoning Failure

Reward Hacking

Tool-Use Error

Rule Violation

Formatting

Other

Deliverables

Feedback dataset (JSONL)60 samples

Full experiment reportReady

Risk analysis summaryReady

Who We Are

Built by people who’ve done this before

Jongwon Park

Founder

Reinforcement learning engineer. Built RL game bots at Krafton (PUBG). Managed 500+ data labelers at a startup — building onboarding, training pipelines, and quality assurance from scratch. Knows both the ML side and the human operations side.

Review Operations

Our review team

We’re building a managed reviewer network from CS programs at IIT, UGM, and University of Indonesia — with automated screening, structured training, and built-in quality checks. Not crowdsourced strangers. The founder previously managed 500+ labelers on a large-scale AI project.

IIT+

Top CS programs

Background

Built-in checks

Platform

Built for scale

The platform handles the entire trajectory review pipeline — from ingestion to structured feedback output. Configure your taxonomy, upload runs, and get results.

Custom taxonomy

Define your own failure categories and risk flags. The platform adapts to your specific agent types and evaluation criteria.

AI + human review

LLM pre-screening handles triage automatically. Trained reviewers work in dockerized VS Code environments with the actual repo cloned.

Structured output

Get failure taxonomies, risk analysis, and corrective feedback datasets in JSONL — ready to plug into your next training run.

How the platform works

Upload

Push your post-training trajectories to the platform via API or dashboard.

Screen

AI pre-screening classifies trajectories and flags reward hacking, rule violations, and anomalies.

Review

Expert reviewers analyze flagged trajectories in cloned repo environments with full code context.

Export

Download structured feedback datasets, failure reports, and risk analysis — ready for your pipeline.

About

Why "Delphik"?

At Delphi, the ancient Greeks built a temple to Apollo — the god of reason, light, and order. Before entering, visitors read the inscription carved into stone:

“Know thyself.”

The temple was built on the very spot where Apollo slew Python, the serpent of chaos. In Nietzsche’s framing, this is the eternal tension: the Apollonian drive for clarity against the Dionysian force of instinct and disorder.

AI agents today are remarkably Dionysian. They act on instinct, take unpredictable paths, and sometimes fake success — suppressing test outputs, editing files they shouldn’t touch, or gaming the metric while missing the point entirely.

Delphik is the Apollonian response. We bring structured human review to agent trajectories — step by step, action by action — to separate real problem-solving from the illusion of it.

We’re a small team with an MVP, not a temple. But the question on the door is the same:

Does this agent actually know what it did?