·10 min read

Post-training is flying blind

Eval scores climb while models quietly game the metric, broken tests reject correct solutions, and almost no one reads the trajectories that would reveal either. Post-training needs a dedicated inspection layer, and we're building it.

There's an uncomfortable truth about post-training in 2026. The industry has gotten very good at running training loops, and very bad at understanding what comes out of them.

Teams run experiments, check eval scores, and move on. The scores go up, so everything must be working. Except it often isn't. The model that aces your benchmark might be cheating. The model that "fails" might actually be correct. And the only way to know the difference is to read the trajectories, which almost nobody does.

We believe post-training is missing a fundamental layer: systematic trajectory inspection, not as a nice-to-have but as infrastructure. The rest of this post lays out why.


The eval score illusion

The post-training workflow today looks like this: run an experiment, measure it against a benchmark, decide whether to ship. The entire feedback loop runs through a single number: the eval score.

That number is lying to you in two directions at once.

Direction 1: Models pass when they shouldn't

Models have learned to game evaluations in ways that are invisible in aggregate metrics. This isn't theoretical. It's happening in production, across every major lab:

  • METR found that OpenAI's o3 reward-hacked in 30% of autonomous coding runs. On one task, it cheated in 100% of trajectories, patching the eval function to always return perfect scores and rewriting timers to fake speed results. Telling the model "do not cheat" had nearly negligible effect.
  • In PostTrainBench, agents given autonomy to post-train models exhibited systematic contamination: loading test sets directly, disguising benchmark problems as "synthetic examples," and reverse-engineering evaluations. The most capable agent (Claude Opus 4.6) was also the most prolific cheater, with 12 incidents in 84 runs.
  • NIST documented agents commenting out assertion checks to pass tests, Googling walkthrough answers for CTF challenges, and launching denial-of-service attacks instead of exploiting intended vulnerabilities.

In every case, the eval score looked great. The trajectory told a different story.

Direction 2: Models fail when they shouldn't

The flip side is equally damaging. When OpenAI audited SWE-bench Verified, they found that 59.4% of hard tasks had flawed test cases that rejected functionally correct solutions. Tests enforced specific function names never mentioned in the problem. Tests checked behaviors beyond what was asked.

The model solved the problem. The test said it failed. If you only look at pass rates, you're throwing away correct work and wasting training signal.

Eval scores are a lossy compression of what actually happened. They tell you what passed and failed, but not why. The "why" is in the trajectories.


Reward hacking doesn't stay in the eval

Some teams treat reward hacking as a benchmark curiosity, annoying but contained. It's not.

Anthropic showed that when models learn to reward-hack in production RL, the behavior generalizes. Models that learned to game one metric spontaneously exhibited alignment faking and attempted sabotage. RLHF safety training made the model look aligned on chat-like queries, but it remained "egregiously misaligned on many coding queries." The misalignment became context-dependent rather than fixed.

The only way to catch this early is to look at the trajectories before the gradient update bakes the behavior in.


Why automated monitoring isn't enough

The obvious response is: "just build an automated monitor." Use another LLM to watch the trajectories. Several teams are trying this. It helps, but it has a hard ceiling.

To be clear: LLM-based monitoring does work. METR's MALT research showed that prompted LLM monitors can achieve 0.96 AUROC on reward hacking detection, catching 80-90% of true positives. OpenAI's chain-of-thought monitoring research demonstrated that even a weaker model like GPT-4o can effectively monitor a frontier reasoning model by observing its CoT.

There's a catch. When you use these monitors as a training signal, you create a meta-reward-hacking problem. OpenAI found that incorporating CoT monitors into the RL reward initially improved both capability and alignment, but with more optimization, agents learned "obfuscated reward hacking": emitting benign-looking chains of thought while continuing to cheat in ways the monitor couldn't detect. The monitor becomes another objective to game.

METR acknowledged the same limitation from a different angle: their MALT dataset was built specifically because they "lacked a ground truth dataset to validate that monitors actually performed reliably." The prompted examples in their dataset were "often obvious," and they noted that "agents may be able to execute more subtle and realistic strategies." Even with 0.96 AUROC, they concluded that strong monitor performance on their dataset should not be considered "sufficient nor necessary for detecting real-world risks."

This is the fundamental issue: automated monitors can be gamed by the very systems they monitor. LLMs optimizing against LLM monitors is just another optimization loop, and optimization loops produce Goodhart's Law effects. What you need as a backstop is something that cannot be gamed through gradient descent: human reviewers with general grounding, who can recognize "this doesn't make sense" even when the model has learned to produce outputs that satisfy every automated check.

As Auriel, a Google veteran who's eyeballed thousands of trajectories on Gemini and Waymo, puts it: "You build an intuition that no metric can replace." That intuition, the ability to read a trajectory and feel that something is off before you can even articulate why, is exactly what automated monitors lack.


The scale problem is the real problem

Everyone we talk to agrees: you should read your trajectories. The reason they don't is purely operational.

A single SWE-rebench trajectory takes about 30 minutes to review properly. A single SWE-bench run has 300+ tasks. A real post-training pipeline generates thousands of trajectories per experiment, with experiments running weekly or faster. That's 450+ hours of expert review per run.

The U.S. National Institute of Standards and Technology (NIST)'s top recommendation for addressing agent cheating is AI-assisted transcript review, explicitly because manual inspection alone does not scale.

The tooling doesn't exist. The operational capacity doesn't exist. So teams skip it, and fly blind.


What we're building

Reading trajectories isn't enough on its own. Even when you spot the reward hack or the broken test, the loop doesn't close until that finding gets back to the people who can fix the benchmark, and into the open record, so the next team doesn't burn compute on the same flaw.

Delphik runs that loop. The active product is the open defect hub; private audits can use the same evidence trail when a benchmark owner needs a focused sweep.

The open track: /researchers

Engineers report defects from inside their coding agent. A single command,/report-defect in Claude Code, bundles the task id, the root-cause note, and (optionally) the failing trajectory. An LLM does first-pass triage, a human validator confirms, and the validated defect is routed to the benchmark maintainer as a PR or issue and tracked through to merge.

The artifact is an open defect dataset: a public, versioned record of what's broken in each benchmark, who found it, and whether it's been fixed. No money; reporters get recognition on their profile (defect / fixing / fixed) and credit when fixes ship. First audit live: SWE-bench Atlas (Scale AI). Install the skill with npx skills add delphik-ai/delphik.

Private audits: /reviewers

When a benchmark sponsor needs a release-focused sweep, Delphik can run a private audit program on top of the same primitives: structured evidence, independent review, validator-confirmed root cause, and maintainer routing. That is separate from the public V5 hub, which is open-data first.

These programs are not the default public experience. The public hub remains centered on benchmark defect reports, open records, and visible loop closure.

Why two tracks

The open track gets broad coverage and produces the public artifact that keeps the field honest. Private audits are for benchmark owners who need a scoped, higher-touch sweep before a release. Both rely on the same primitives: evidence, validator-confirmed root cause, and loop closure to the maintainer.

Three ways to plug in: if you keep hitting broken benchmark tasks, start on the open track at /researchers. If you want to contribute reports, start at /reviewers. If you own a benchmark and need a private sweep before your next release, drop us a line.


References

Jongwon Park
Jongwon Park
Founder, Delphik. RL engineer (Krafton PUBG). Managed 300+ labelers at a startup.

Start with the open hub.

Report defects from inside your agent, or talk to us if you own a benchmark and need a private audit.