In early 2024, Klarna announced it had replaced approximately 700 customer service agents with an AI assistant. The company promoted the move publicly, claiming the AI handled two-thirds of customer support chats and matched the productivity of its former human team. It looked like a clean automation win.
A year later, CEO Sebastian Siemiatkowski walked it back. "As cost unfortunately seems to have been a too predominant evaluation factor," he said, "what you end up having is lower quality." The AI couldn't show empathy, couldn't interpret emotional context, couldn't handle the nuanced situations that were actually the hard part of the job. Klarna shifted back to a hybrid model, repositioning human support as a trust differentiator rather than a cost center.
Klarna didn't get burned by automation. It got burned by going straight to full autonomy without a supervised phase where the system could have learned what it couldn't handle.
Why jumping to autonomy backfires
The appeal of full automation is obvious: set it up once, let it run, then stop thinking about it. But there's a reason only 6% of companies fully trust AI systems to run core business processes without oversight (and it's not that the other 94% are behind the curve).
Research on trust in automated systems consistently shows that trust is dynamic. It develops gradually through experience and observed performance, and it breaks much faster than it builds. A single early failure (especially a simple, visible one) can wipe out the credibility the system took weeks to establish. That asymmetry is why starting cautiously is not only about risk management, but how you end up with automation you actually keep using.
Deploying full automation before you have a track record means you're extending trust based on a demo or a pilot, not on real performance in your specific context. When the first mistake happens (and it will), you have no baseline to compare against, no evidence that the system normally handles this case well, and no reason to keep the automation running rather than tearing it out.
The four rungs of the supervised AI automation framework
Think of automation adoption as a ladder with four rungs. You don't have to start at the bottom forever, but starting higher than you've earned is how you end up making the climb twice.
Rung 1: Fully manual. You do everything yourself. Every email, every decision, every action. This is the starting point for most people, and the right one, because it gives you a clear baseline for what good looks like before any AI gets involved.
Rung 2: AI-assisted. The AI drafts, summarizes, and suggests, but you execute every action. Nothing fires without your explicit instruction. This is where you learn what the AI does well in your specific context and what it gets wrong. It costs you nothing to be wrong here because nothing happens until you say so.
Rung 3: Supervised autonomy. The AI executes independently for decisions it handles consistently well, and pauses for your review on everything else. You review exceptions, not every action. This is where most of the time savings come from, and where the actual learning happens.
Rung 4: Fully autonomous. The AI handles specific, well-understood tasks without any human intervention. Not all tasks. The ones where it has earned that trust through a demonstrated track record on your actual data.
Rung 4 isn't "the AI does everything." It's the AI doing specific things it has proven it can do, reliably, in your context. Klarna tried to jump from rung 1 to rung 4 across all of customer support at once. The rungs they skipped were where the system would have learned what it couldn't handle.
How to know when to advance
The natural question is what makes something ready to move up a rung. "It seems to be working" isn't an answer you can act on when you're deciding whether to remove human review from a step that sends emails to clients.
Confidence scoring answers this concretely. Every time a workflow step runs, it scores that specific execution: how clear was the input, how confident is the classification, how closely does this case resemble ones the system has handled correctly before? High-confidence executions accumulate a track record. Low-confidence ones surface for review.
After two or three weeks of running a workflow at supervised autonomy, you can see clearly: the AI classifies inbound leads correctly 97% of the time when the email contains a company name and a specific product question, and misclassifies about a third of the time when the email is vague or ambiguous. You can let the confident cases run automatically and keep the ambiguous ones in your manual queue. You're not guessing anymore; you're looking at actual performance data from your actual inputs.
Stitch Fix built a permanent version of this for outfit recommendations. Their engineering team runs daily human review of algorithmically-generated outfits against a quality rubric, not because they don't trust the algorithm, but because incorporating that feedback loop produced a 14% improvement in their internal quality measure and measurable revenue lift. The human layer isn't a temporary scaffold they're planning to remove. It's part of what makes the system work.
You may not need permanent human review for every workflow you build. But the principle holds: supervised operation is where you learn what the system actually does, not what the demo suggested it would do.
The queue that teaches itself
One concern people have about supervised automation is that the review queue never gets smaller: that you're trading manual work for slightly different manual work. In practice, it goes the other way.
When you approve or reject a step, that feedback teaches the system for future runs. Cases that match patterns you've consistently approved start clearing automatically. Cases that resemble ones you've corrected stay in the queue longer. After a few weeks, you're reviewing the genuinely hard calls, the ones that actually deserve human judgment, not re-litigating the same clear-cut cases you've already established patterns for.
A workflow that routes 40 items to your inbox in its first week might route 8 a few weeks later, not because it got smarter in some abstract sense but because it developed a track record on your specific decisions. The difference between AI agents and structured workflows matters here too: because the execution path is defined and each step is discrete, the system knows exactly which step produced which outcome and can apply that learning precisely where it's relevant.
Where to start
If you're currently doing everything manually because you don't trust AI automation, or you tried something fully autonomous and it didn't hold up, the supervised rung is the right entry point.
Pick one workflow. Run it at supervised autonomy for two weeks. Review every action it proposes. Pay attention to which ones are consistently right and which ones surprise you. At the end of week two, you'll have a concrete picture of what's ready to advance and what needs more time. You'll also have something Klarna didn't have before it made its announcement: evidence.
If you're not sure which workflow to start with, client follow-up automation is a good first case. The inputs are predictable, the output is a single email draft, and the approval step is natural. Most people see their review queue shrink noticeably within three weeks. That track record is what earns the next rung.
Approvals are always free on Rills. You only pay when the AI takes an action that creates real value (a sent email, a CRM update, an API call). Every review step, every confidence check, every approval in your queue costs nothing. Build your first supervised workflow and start collecting the track record that earns autonomy.
Ready to automate your workflows?
Eliminate monitoring anxiety with AI agents that propose actions while you stay in control. Start your 14-day trial today.
Start Free Trial14-day trial, no credit card required