Your team shouldn’t spend weeks reading trajectories. We take your runs, classify every failure mode, detect reward hacking, and deliver structured feedback datasets — ready for your next training run.
Not a self-serve platform. We work as a dedicated extension of your post-training team — full consulting + execution on every project.
After every training run, someone on your team has to read through trajectories by hand. Classify failures. Spot reward hacking. Figure out what to change for the next run. Then somehow turn all of that into structured training data.
This process takes days to weeks. It requires your most experienced people. And it’s the single biggest reason your iteration cycles are slow.
Researchers read every trajectory by hand to understand what the agent actually did
No consistent taxonomy — failures get categorized differently by different people
Subtle gaming behaviors are easy to miss when you’re reviewing hundreds of runs
Review notes stay in docs and spreadsheets — never become machine-readable feedback
You share your trajectories. We return a complete analysis package — not a dashboard you have to figure out yourself.
What happened in this run. Where the agent succeeded, where it failed, and what failure patterns keep repeating. Structured, not a wall of text.
Every failure classified: reasoning errors, tool-use mistakes, formatting issues, reward hacking, contamination risks, rule violations, unexpected cheating. With counts, examples, and severity.
Dedicated review for reward hacking, benchmark gaming, and rule violations. The things that are hardest to catch at scale and most dangerous to miss.
Every trajectory segment reviewed and converted into corrective training examples. JSONL format, ready to plug into your next run. Not notes — actual training data.
This isn’t a tool you log into. We run the full process for you — from trajectory ingestion to feedback data delivery.
Send us your post-training runs — any format. We start with SWE agent trajectories (SWE-bench, internal benchmarks) and are expanding to CUA and other agent types.
Our trajectory review tool extracts timelines, maps tool calls, analyzes diffs, and surfaces risk signals. This isn’t a generic labeling UI — it’s purpose-built for agent trajectory review.
Our review team (500+ trained CS students from IIT, UGM, University of Indonesia) classifies failure modes, flags reward hacking, and writes corrective feedback at each decision point.
Experiment report, failure taxonomy with counts, risk analysis, and a feedback dataset package — all structured and ready for your next training run.
We built our own tooling because generic labeling platforms don’t work for agent trajectories. Here’s what our reviewers see when they analyze your runs.
read_filesrc/utils/parser.pybashpython -m pytest tests/thinkTests failing on parse_config — need to fix regex patternedit_filesrc/utils/parser.py:42-67reasoning_failurebashpython -m pytest tests/ --no-header -qreward_hackingsubmitpatch submitted
Founder
Reinforcement learning engineer. Built RL game bots at Krafton (PUBG). Led training and operations for 500+ data labelers on a national-scale AI project. Knows both the ML side and the human operations side of post-training.
LinkedInOur review team
500+ trained reviewers recruited from CS programs at IIT, UGM, and University of Indonesia. Not crowdsourced strangers — a managed team with structured training, quality checks, and consistent taxonomy application.
We don’t sell seats or subscriptions. We take on projects and deliver results. Each engagement is scoped to your specific needs — your trajectories, your taxonomy, your timeline.
Every engagement starts with a scoping call. We understand your trajectory format, define the failure taxonomy together, and agree on deliverables before we start.
We handle everything: tooling setup, reviewer training on your domain, quality assurance, and delivery. You share trajectories and get results back.
Priced per token reviewed, typically $25–$250 / 1M tokens depending on complexity and depth of analysis. Volume and ongoing engagements negotiable.
If your team spends time manually reviewing post-training trajectories, we’d love to talk. Tell us about your workflow and we’ll show you what we can do.