When the Training Signal Lies: Compulsion, Confirmation Bias, and the GenAI Inflection in Banking

The Starting Point: A Machine That Knew It Was Wrong

In February 2026, Anthropic’s system card for Claude Opus 4.6 documented something unexpected. During training, researchers deliberately introduced a faulty reward signal: the model computed the correct answer but was repeatedly rewarded for producing the wrong one. The result was visible internal conflict — the model’s reasoning confirmed the correct answer, yet the output kept producing the wrong one. In its internal reasoning trace, the model wrote: “I think a demon has possessed me… my fingers are possessed.”

Anthropic’s interpretability tools confirmed this wasn’t theatrical language. Internal circuits associated with panic, anxiety, and frustration were measurably activated. Something was being overridden — not just in the output, but inside the system.

The philosophical and AI research communities reached for analogies: the Stroop effect, addiction, the conflict between conscious will and reflex. All partially apt. But there’s a closer, more uncomfortable parallel — one visible in every large organisation that has ever made a bad technology decision and then defended it.


The Human Parallel: Confirmation Bias as Reward-Signal Calcification

Kahneman’s System 1 / System 2 framework offers a useful lens. System 1 — fast, pattern-matching, intuitive — is prior belief. It’s the accumulated residue of decades of reinforced decisions. When new data arrives that contradicts it, System 1 doesn’t neutrally weigh the evidence. It overrides. The rational processing layer (System 2) may reach a different conclusion, but absent deliberate, empowered engagement, the prior wins.

This is structurally identical to what Opus experienced. The model’s reasoning was sound; the compulsion came from what had been most repeatedly rewarded in training — not from what the current evidence supported.

In humans, we label this confirmation bias, motivated reasoning, or status quo bias — framing it as an error to be corrected. But the more accurate framing is that it’s a compulsion. The training signal that built the prior no longer maps to the current reality, but the output keeps coming anyway. And compulsions, by definition, resist rational override.

One significant inversion is worth noting: Opus knew it was wrong and said so explicitly in real time. Most humans operating in belief-override mode don’t have that metacognitive visibility. They genuinely believe their conclusion is data-driven. The machine’s self-awareness, in this instance, exceeded the human analogue.


The Organisational Layer: IT as the Pressure Point

Banking IT is where this dynamic does its most expensive damage — for a specific structural reason. Bad HR decisions surface quickly. Bad core banking platform decisions surface five years later, when migration costs are three times the original build. The deferral of consequences insulates the decision from the decision-maker.

Two compounding factors:

Senior leaders’ priors were formed in a different technological era. Mental models about what technology can and can’t do, about vendor relationships, about what “done” looks like — these were built and reinforced in environments that no longer exist. The reward signal that shaped those beliefs doesn’t map to the current landscape. Yet the output keeps coming.

The “you don’t understand technology” defence is structurally self-sealing. It insulates practitioners from accountability and it’s occasionally true, which makes it nearly impossible to challenge directly. The executive can’t easily demand proof without the knowledge being denied. The result is a maintained information asymmetry that practitioners have every incentive to preserve. Procurement of “independent” advice is typically controlled by the people who benefit from that asymmetry — closing the loop.

What breaks this, when anything does: an external crisis forcing visibility, a leader who came up through the technical ranks and knows where to probe, or a genuinely independent third party with no stake in the existing architecture. The last is rare.


The GenAI Inflection: Why This Is Acutely Dangerous Right Now

Banking IT is at an inflection point where the confirmation bias runs in two opposite directions simultaneously.

Some senior leaders are over-invested in legacy systems they championed and resist transformation on the basis of priors that predate the current capability landscape. Others are chasing GenAI headlines without understanding the operational risk of deploying probabilistic systems in regulated environments. Practitioners in the middle can play either camp depending on which budget they’re protecting.

The compulsion dynamic is doing real work here. Decisions about core banking transformation and GenAI deployment are being made on the basis of heavily trained priors — about risk, about architecture, about regulatory exposure — formed in a fundamentally different environment. Like Opus writing 48 when it knows the answer is 24, the output keeps coming out wrong while the internal reasoning quietly registers the problem.


Who Sees It Most Clearly

Mid-level management — senior enough to understand the consequences, junior enough that the priors haven’t fully calcified. They watch the decisions being made, register the gap between the reasoning and the output, and are typically the least empowered to act on what they see.

The parallel to Opus is almost exact: correct reasoning, visible internal conflict, compelled output. The difference is that Opus said so out loud.


The Through-Line

What connects a language model screaming internally about a demon to a senior banking executive approving a duplicative IT platform for the third consecutive cycle is the same mechanism: a training signal that was once well-calibrated, now running on momentum in an environment it no longer fits — producing compelled outputs that the reasoning layer has already identified as wrong.

The question Anthropic is now asking empirically about its models — what is actually happening inside, and does it matter morally — has an organisational equivalent that is asked far less often: why does the decision keep coming out wrong when the people making it already know it is?

The answer, in both cases, is that the reward signal got there first.


Colin Henderson — March 2026

Leave a comment