Sign inStart your trial

Product

The Automation Trust Ladder: Why Klarna Rehired 700 Agents

Manual, supervised, autonomous: the three rungs of the automation trust ladder. Klarna skipped one, automated support end to end, and had to rehire the team.

Ladder rising through stages of light, representing progression from manual work through supervised AI automation to full autonomy
8 min read

In early 2024, Klarna announced it had replaced approximately 700 customer service agents with an AI assistant. The company promoted the move publicly, claiming the AI handled two-thirds of customer support chats and matched the productivity of its former human team. It looked like a clean automation win.

A year later, in May 2025, CEO Sebastian Siemiatkowski walked it back. “As cost unfortunately seems to have been a too predominant evaluation factor,” he said, “what you end up having is lower quality.” The AI couldn’t show empathy, couldn’t interpret emotional context, couldn’t handle the nuanced situations that were actually the hard part of the job. Klarna started hiring human agents again and called investing in the quality of human support “the way of the future”, repositioning human support as a trust differentiator rather than a cost center.

Klarna jumped straight to full autonomy across all of support, skipping the supervised phase in between. That phase is where it would have caught the empathy and nuance gaps before the AI was the only thing answering customers.

Why jumping to autonomy backfires

The appeal of full automation is obvious: set it up once, let it run, then stop thinking about it. But there’s a reason only 6% of companies fully trust AI systems to run core business processes without oversight (and it’s not that the other 94% are behind the curve). Klarna wasn’t an outlier in overreaching, either. An IBM survey of 2,000 CEOs found only one in four AI projects delivered the return on investment they promised, and just 16% had scaled across the enterprise.

Research on trust in automated systems consistently shows that trust is dynamic. It develops gradually through experience and observed performance, and it breaks much faster than it builds. A single early failure (especially a simple, visible one) can wipe out the credibility the system took weeks to establish. That asymmetry is why starting cautiously is not only about risk management, but how you end up with automation you actually keep using.

Deploying full automation before you have a track record means you’re extending trust based on a demo or a pilot, not on real performance in your specific context. When the first mistake happens (and it will), you have no baseline to compare against, no evidence that the system normally handles this case well, and no reason to keep the automation running rather than tearing it out.

The four rungs of the supervised AI automation framework

Think of automation adoption as a ladder with four rungs. You don’t have to start at the bottom forever, but starting higher than you’ve earned is how you end up making the climb twice.

Rung 1: Fully manual. You do everything yourself. Every email, every decision, every action. This is the starting point for most people, and the right one, because it gives you a clear baseline for what good looks like before any AI gets involved.

Rung 2: AI-assisted. The AI drafts, summarizes, and suggests, but you execute every action. Nothing fires without your explicit instruction. This is where you learn what the AI does well in your specific context and what it gets wrong. It costs you nothing to be wrong here because nothing happens until you say so.

Rung 3: Supervised autonomy. The AI executes independently for decisions it handles consistently well, and pauses for your review on everything else. You review exceptions, not every action. This is where most of the time savings come from, and where the actual learning happens.

Rung 4: Fully autonomous. The AI handles specific, well-understood tasks without any human intervention. Not all tasks. The ones where it has earned that trust through a demonstrated track record on your actual data.

Rung 4 isn’t “the AI does everything.” It’s the AI doing specific things it has proven it can do, reliably, in your context. Klarna tried to jump from rung 1 to rung 4 across all of customer support at once. The rungs they skipped were where the system would have learned what it couldn’t handle.

How the rungs map to in-the-loop, on-the-loop, and out-of-the-loop

If the ladder feels familiar, it’s because it lines up with how the field already describes human oversight of AI. The usual split is three positions, and the rungs slot right into them.

Human-in-the-loop means the human sits inside the decision cycle. The agent proposes, but nothing executes until you approve, edit, or sign off. That’s rung 2 and the review side of rung 3, and it’s the right posture for actions that are hard to undo: payments, contract changes, anything sent to a client.

Human-on-the-loop means the agent runs its full cycle on its own while you supervise from above, stepping in only for exceptions and anomalies. That’s the autonomous side of rung 3, where the confident cases clear themselves and the odd one surfaces for review.

Human-out-of-the-loop means the agent acts with no real-time involvement, and oversight shrinks to checking the logs after the fact. That’s rung 4, and it only belongs on high-frequency, low-risk, reversible work where a wrong call is cheap to catch and cheaper to fix. The mistake Klarna made reads cleanly in this language: it moved frontline support straight to out-of-the-loop, where the cost of a bad answer was anything but cheap.

How to know when to advance

The natural question is what makes something ready to move up a rung. “It seems to be working” isn’t an answer you can act on when you’re deciding whether to remove human review from a step that sends emails to clients.

Confidence scoring answers this concretely. Every time a workflow step runs, it scores that specific execution: how clear was the input, how confident is the classification, how closely does this case resemble ones the system has handled correctly before? High-confidence executions accumulate a track record. Low-confidence ones surface for review.

After two or three weeks of running a workflow at supervised autonomy, you can see clearly: the AI classifies inbound leads correctly 97% of the time when the email contains a company name and a specific product question, and misclassifies about a third of the time when the email is vague or ambiguous. You can let the confident cases run automatically and keep the ambiguous ones in your manual queue. You’re not guessing anymore; you’re looking at actual performance data from your actual inputs.

Stitch Fix built a permanent version of this for outfit recommendations. Their engineering team runs daily human review of algorithmically-generated outfits against a quality rubric, not because they don’t trust the algorithm, but because incorporating that feedback loop produced a 14% improvement in their internal quality measure and measurable revenue lift. The human layer isn’t a temporary scaffold they’re planning to remove. It’s part of what makes the system work.

You may not need permanent human review for every workflow you build. But the principle holds: supervised operation is where you learn what the system actually does, not what the demo suggested it would do.

The queue that teaches itself

One concern people have about supervised automation is that the review queue never gets smaller: that you’re trading manual work for slightly different manual work. In practice, it goes the other way.

When you approve or reject a step, that feedback teaches the system for future runs. Cases that match patterns you’ve consistently approved start clearing automatically. Cases that resemble ones you’ve corrected stay in the queue longer. After a few weeks, you’re reviewing the genuinely hard calls, the ones that actually deserve human judgment, not re-litigating the same clear-cut cases you’ve already established patterns for.

A workflow that routes 40 items to your inbox in its first week might route 8 a few weeks later, not because it got smarter in some abstract sense but because it developed a track record on your specific decisions. The difference between AI agents and structured workflows matters here too: because the execution path is defined and each step is discrete, the system knows exactly which step produced which outcome and can apply that learning precisely where it’s relevant.

Where to start

If you’re currently doing everything manually because you don’t trust AI automation, or you tried something fully autonomous and it didn’t hold up, the supervised rung is the right entry point.

Pick one workflow. Run it at supervised autonomy for two weeks. Review every action it proposes. Pay attention to which ones are consistently right and which ones surprise you. At the end of week two, you’ll have a concrete picture of what’s ready to advance and what needs more time. You’ll also have something Klarna didn’t have before it made its announcement: evidence.

If you’re not sure which workflow to start with, client follow-up automation is a good first case. The inputs are predictable, the output is a single email draft, and the approval step is natural. Most people see their review queue shrink noticeably within three weeks. That track record is what earns the next rung.

Approvals are always free on Rills. You only pay when the AI takes an action that creates real value (a sent email, a CRM update, an API call). Every review step, every confidence check, every approval in your queue costs nothing. Build your first supervised workflow and start collecting the track record that earns autonomy.

Ready to automate your workflows?

AI proposes the action, you approve it, and the record shows who signed off.

14-DAY TRIAL · NO CREDIT CARD · APPROVALS ARE FREE