AI that survives reality

Field notes · Erste Bank · 2023 onwards · Condensed from "AI in regulated industries", the whitepaper we published at wild in February 2026. The product itself hasn't launched publicly yet, so this piece stays at the level of decisions. The decisions were most of the work.

Cover of the whitepaper AI in regulated industries, February 2026 — AI in regulated industries · read the whitepaper (PDF)

In 2023 wild became Erste Bank's external sparring partner on Austria's first financial AI. My team built the earliest prototypes and helped define the path to production, working alongside the bank's own product and engineering organization, inside an institution that has been looking after people's money for over two hundred years and does it today for eight million customers. What we learned there, and in similar work in insurance and healthcare, we later wrote down as a team in a whitepaper. This is the short version.

The question that comes after the demo

Every organization is having the same AI moment. Someone has a demo, everyone has an opinion, and leadership asks the question that comes after experimentation: can we actually run this? For a startup that is mostly a product question. In a bank it is a risk question. The system has to be predictable, auditable and governable under real pressure, because a regulator will not accept "the vibe was off" as a root-cause analysis. We put that sentence in the paper because every compliance officer we showed it to nodded.

Write down what it must never do

A model will happily answer anything, which in a bank is precisely the problem. So before we wrote a feature list, we wrote down everything the assistant must never do. It never gives personalized investment advice without the required disclosures. It never claims certainty where uncertainty exists. It never guesses at a figure it can look up or refuse, and it never improvises about anything with legal weight. Every one of those lines was argued over with the people responsible for the bank's risk, and the safest way to hold them is deny-by-default. You allow the behaviors you have approved instead of listing the ones you forbid, because a list of forbidden things is outdated the day after you write it.

A boring core and a talking surface

The architecture that survives review separates responsibilities. Deterministic systems handle the decisions and their side effects, the eligibility checks, the calculations, the approvals, everything where the same input must produce the same answer. The language model handles interpretation and communication, the clarifying questions and the explanations. In the paper we call it the calculator and the narrator.

"The calculator must be boring. The narrator can be smart."

The narrator is still probabilistic, and you do not wish that away, you measure it. Agents will not behave identically from run to run, so you test rates instead of anecdotes, and you test both sides. A system evaluated only on what it should do optimizes into overconfidence, so the things it must never do sit in the same test suite, run on every change.

Compliance as a test suite

The biggest unlock with legal and compliance was giving them something they can actually control. Their guidelines became versioned documents, the documents became machine-runnable test sets full of near-miss cases and deliberate policy-violation attempts, and automated graders run them continuously. A build that violates a guideline fails before anyone sees it. When we started, legal review was a gate at the end of the process. By the end it was a force multiplier, with faster approvals and fewer surprises, because "are we compliant" had turned into something the system demonstrates continuously instead of something a meeting asserts once.

Escalation got the same treatment. The most important flow we designed is the one where the assistant stops talking and hands over to a human, with the context of the conversation attached, so nobody starts again from zero. It is defined, tested and monitored like any other feature, because a handoff that only exists as a good intention is not a control.

The assistant was only the most visible part. Around it we prototyped and helped define the product surface a bank actually needs: tools for navigating subsidies, understanding contracts, making sense of real estate financing, and teaching people how to start investing. We brought memory into the product early, and with it a discipline the paper spells out, that model output is not evidence. What the assistant knows comes from a knowledge pipeline that legal and compliance can verify, so it is recent, accurate and signed off rather than scraped. Answers are tiered to the customer's location and branch, down to the interest rates on offer, and grounded in external data like real estate prices alongside the customer's own transactions.

There is also an uncomfortable lesson in the paper that shaped the pace: if you do not give people a safe internal tool, they route around you and paste customer data into whatever browser tool answers fastest. The winning strategy is not prohibition, it is a governed tool good enough to become the path of least resistance. That is why we moved in weeks, with business, legal and compliance in the room from day one.

The recurring theme of the whitepaper is that AI in regulated industries fails less often because of the models and more often because of everything around them. The goal is not to make AI perfect, it is to make it governable: systems that can explain themselves, fail safely and improve over time. That is the difference between a demo that impresses and a system that survives audits, incidents and reality. Models will keep making the options. Deciding what never to say stayed a human job, and I think it will stay one. It's the part of the work I want to keep doing, at bigger scale, closer to the model.