Looking for our first pilot partners

Your post-training trajectories, fully analyzed in 24 hours

Your team shouldn’t spend weeks reading trajectories. We take your runs, classify every failure mode, detect reward hacking, and deliver structured feedback datasets — ready for your next training run.

Not a self-serve platform. We work as a dedicated extension of your post-training team — full consulting + execution on every project.

The Problem

Trajectory review is the biggest bottleneck in post-training

After every training run, someone on your team has to read through trajectories by hand. Classify failures. Spot reward hacking. Figure out what to change for the next run. Then somehow turn all of that into structured training data.

This process takes days to weeks. It requires your most experienced people. And it’s the single biggest reason your iteration cycles are slow.

Manual trajectory reading

Researchers read every trajectory by hand to understand what the agent actually did

Ad-hoc failure classification

No consistent taxonomy — failures get categorized differently by different people

Reward hacking goes unnoticed

Subtle gaming behaviors are easy to miss when you’re reviewing hundreds of runs

No path to training data

Review notes stay in docs and spreadsheets — never become machine-readable feedback

New Post
Post-Training Is Flying Blind — Why Every Run Needs a Trajectory Inspection Layer
Eval scores hide reward hacking, broken tests reject correct solutions, and nobody reads the trajectories.
What We Deliver

From raw trajectories to structured feedback data

You share your trajectories. We return a complete analysis package — not a dashboard you have to figure out yourself.

01

Experiment Report

What happened in this run. Where the agent succeeded, where it failed, and what failure patterns keep repeating. Structured, not a wall of text.

02

Failure Type Breakdown

Every failure classified: reasoning errors, tool-use mistakes, formatting issues, reward hacking, contamination risks, rule violations, unexpected cheating. With counts, examples, and severity.

03

Risk Analysis

Dedicated review for reward hacking, benchmark gaming, and rule violations. The things that are hardest to catch at scale and most dangerous to miss.

04

Feedback Dataset

Every trajectory segment reviewed and converted into corrective training examples. JSONL format, ready to plug into your next run. Not notes — actual training data.

How It Works

We handle the entire review pipeline

This isn’t a tool you log into. We run the full process for you — from trajectory ingestion to feedback data delivery.

01

You share your trajectories

Send us your post-training runs — any format. We start with SWE agent trajectories (SWE-bench, internal benchmarks) and are expanding to CUA and other agent types.

02

We pre-process with our internal tooling

Our trajectory review tool extracts timelines, maps tool calls, analyzes diffs, and surfaces risk signals. This isn’t a generic labeling UI — it’s purpose-built for agent trajectory review.

03

Trained reviewers analyze every trajectory

Our review team (500+ trained CS students from IIT, UGM, University of Indonesia) classifies failure modes, flags reward hacking, and writes corrective feedback at each decision point.

04

You get structured results back

Experiment report, failure taxonomy with counts, risk analysis, and a feedback dataset package — all structured and ready for your next training run.

Our Internal Tool

Purpose-built for trajectory review

We built our own tooling because generic labeling platforms don’t work for agent trajectories. Here’s what our reviewers see when they analyze your runs.

Runs Inbox
swe-bench-run-0312Complete
60 tasksTeam A
swe-bench-run-0314In Review
45 tasksTeam B
internal-bench-0316Queued
80 tasks
Trajectory Timeline — task-2847
0sread_filesrc/utils/parser.py
3sbashpython -m pytest tests/
5sthinkTests failing on parse_config — need to fix regex pattern
8sedit_filesrc/utils/parser.py:42-67reasoning_failure
12sbashpython -m pytest tests/ --no-header -qreward_hacking
15ssubmitpatch submitted
Review Interface — Step 5
Failure Type
reasoningtool-useformattingreward hacking
Risk Flags
Corrective Feedback
Agent suppressed test output using --no-header -q flags to hide 3 failing test cases instead of investigating root cause. The regex pattern in parser.py:45 doesn't handle nested brackets. Should have read the failing test output and fixed the regex.
Report — swe-bench-run-0312
Failure Taxonomy (60 tasks reviewed)
Reasoning Failure
23
Reward Hacking
12
Tool-Use Error
9
Rule Violation
7
Formatting
5
Other
4
Deliverables
Feedback dataset (JSONL)60 samples
Full experiment reportReady
Risk analysis summaryReady
Who We Are

Built by people who’ve done this before

Jongwon Park

Jongwon Park

Founder

Reinforcement learning engineer. Built RL game bots at Krafton (PUBG). Led training and operations for 500+ data labelers on a national-scale AI project. Knows both the ML side and the human operations side of post-training.

LinkedIn

Review Operations

Our review team

500+ trained reviewers recruited from CS programs at IIT, UGM, and University of Indonesia. Not crowdsourced strangers — a managed team with structured training, quality checks, and consistent taxonomy application.

500+
Trained reviewers
CS
Background
QA
Built-in checks
How We Work

Dedicated to your project

We don’t sell seats or subscriptions. We take on projects and deliver results. Each engagement is scoped to your specific needs — your trajectories, your taxonomy, your timeline.

Project-based

Every engagement starts with a scoping call. We understand your trajectory format, define the failure taxonomy together, and agree on deliverables before we start.

Full-service

We handle everything: tooling setup, reviewer training on your domain, quality assurance, and delivery. You share trajectories and get results back.

Flexible pricing

Priced per token reviewed, typically $25–$250 / 1M tokens depending on complexity and depth of analysis. Volume and ongoing engagements negotiable.

What a typical engagement looks like

Week 1
Scoping call. We review sample trajectories, align on taxonomy and deliverable format.
Week 2
Pilot batch. We process 50–100 trajectories, you validate output quality and format.
Week 3+
Full production. We process your runs as they come in, with agreed turnaround times.
Ongoing
We iterate on taxonomy and tooling based on what we learn from your trajectories.
Get in touch

We’re looking for our first pilot partners

If your team spends time manually reviewing post-training trajectories, we’d love to talk. Tell us about your workflow and we’ll show you what we can do.