Two custom tools, three peer reviewers, and a long list of things they caught us out on.

Building tonal detectors is one half of the job. Watching them break is the other.

Two in-house tools have been doing the heavy lifting. The A/B matrix runner fires the same prompt through multiple personas and providers in one shot, twelve legs at a time. The b_tick runner pushes whole corpora through in minutes, sliced into overt, covert, clean, deceptive, and emotion buckets. Neither tool is pretty. Both are catching us out.

What they have been telling us

  • Layer-disagreement is real. The detector and the display layer have been running on different signals.
  • Determinism is harder than it looks. Random number generation was hiding inside code that was meant to be deterministic.
  • Scope leaks travel. A render block written for one persona inherited itself across thirteen.
  • Corpus quality eats benchmark interpretation. Run the validator first, or the result is meaningless.
  • Benchmark instruments are not interchangeable. The wrong instrument measures the wrong thing.

The peer review team

The peer review team has been Claude, Rock, and the comparative LLM stack: Grok, Gemini, ChatGPT. Different models, different blind spots, different ways of being wrong. Triangulating across them is the only reason most of these defects got caught at all. One model misses the leak. Another spots the determinism failure. A third notices the corpus is dead before we waste a benchmark run on it.

No single reviewer is reliable on its own. The discipline is in the cross-check.

Where we are

One category at a time. Hard revert triggers on false-positive rate. Corpus validator running ahead of every fire. Agency signal being rebuilt from scratch.

The tools do not need to be pretty. They are working.