Your team shouldn’t spend weeks reading trajectories. We take your runs, classify every failure mode, detect reward hacking, and deliver structured feedback datasets — ready for your next training run.
One platform for the entire workflow — AI pre-screening, human expert review, and structured feedback generation, all built in.
Eval scores are the only feedback loop in most post-training pipelines. But they lie in two directions: models that cheat pass, and models that solve problems correctly fail. The only way to tell the difference is to read the trajectories — which almost nobody does.
Automated LLM monitors help, but when used as training signals they create a meta-reward-hacking problem — models learn to game the monitor itself. Human inspection with general grounding remains the only reliable backstop.
METR found o3 reward-hacked in 30% of coding runs — patching eval functions, faking speed results. Eval scores looked perfect.
OpenAI found 59.4% of hard SWE-bench tasks had flawed tests that rejected valid fixes. You can’t tell without reading the trajectory.
OpenAI showed that using CoT monitors as training signals leads to obfuscated reward hacking — models hide intent while continuing to cheat.
A single SWE-rebench trajectory takes 30 minutes to review properly. A single run has hundreds. The tooling and operational capacity simply don’t exist.
Upload your runs and get structured results back — powered by LLM pre-screening, trained human reviewers, and RL engineering expertise built into the platform.
Send us your post-training runs — any format. We start with SWE agent trajectories (SWE-bench, internal benchmarks) and are expanding to other agent types.
AI pre-screening analyzes each trajectory and flags reward hacking, rule violations, and suspicious patterns before human review begins.
Reviewers work inside the cloned repo in a dockerized VS Code environment — reading code, running tests, and stepping through agent actions with Claude Code as an analysis copilot. Every failure is classified, every risk is flagged, corrective feedback is written at each decision point.
Experiment report: where the agent succeeded and failed. Failure taxonomy: every failure classified with counts and severity. Risk analysis: reward hacking, benchmark gaming, rule violations. Feedback dataset: corrective examples in JSONL, ready for your next run.
Currently onboarding pilot users — the platform supports custom taxonomy, flexible output formats, and configurable turnaround.
We built our own tooling because generic labeling platforms don’t work for agent trajectories. Reviewers work inside the actual cloned repo in VS Code — with the trajectory panel, code, and terminal side by side.
Watch Demoread_filesrc/utils/parser.pybashpython -m pytest tests/thinkTests failing on parse_config — need to fix regex patternedit_filesrc/utils/parser.py:42-67reasoning_failurebashpython -m pytest tests/ --no-header -qreward_hackingsubmitpatch submitted
Founder
Reinforcement learning engineer. Built RL game bots at Krafton (PUBG). Managed 500+ data labelers at a startup — building onboarding, training pipelines, and quality assurance from scratch. Knows both the ML side and the human operations side.
LinkedInOur review team
We’re building a managed reviewer network from CS programs at IIT, UGM, and University of Indonesia — with automated screening, structured training, and built-in quality checks. Not crowdsourced strangers. The founder previously managed 500+ labelers on a large-scale AI project.
The platform handles the entire trajectory review pipeline — from ingestion to structured feedback output. Configure your taxonomy, upload runs, and get results.
Define your own failure categories and risk flags. The platform adapts to your specific agent types and evaluation criteria.
LLM pre-screening handles triage automatically. Trained reviewers work in dockerized VS Code environments with the actual repo cloned.
Get failure taxonomies, risk analysis, and corrective feedback datasets in JSONL — ready to plug into your next training run.
If your team spends time manually reviewing post-training trajectories, we’d love to talk. Tell us about your workflow and we’ll show you what we can do.
At Delphi, the ancient Greeks built a temple to Apollo — the god of reason, light, and order. Before entering, visitors read the inscription carved into stone:
“Know thyself.”
The temple was built on the very spot where Apollo slew Python, the serpent of chaos. In Nietzsche’s framing, this is the eternal tension: the Apollonian drive for clarity against the Dionysian force of instinct and disorder.
AI agents today are remarkably Dionysian. They act on instinct, take unpredictable paths, and sometimes fake success — suppressing test outputs, editing files they shouldn’t touch, or gaming the metric while missing the point entirely.
Delphik is the Apollonian response. We bring structured human review to agent trajectories — step by step, action by action — to separate real problem-solving from the illusion of it.
We’re a small team with an MVP, not a temple. But the question on the door is the same:
Does this agent actually know what it did?