Designing human-in-the-loop for exceptions, not approvals

Here's a pattern I've seen at every fund data platform I've worked with: someone builds automation, then wraps every output in a manual approval step "for safety." Congratulations — you've built a dashboard with extra steps.

Human-in-the-loop sounds responsible. It sounds prudent. In regulated industries like fund management, it sounds like the kind of thing compliance would love. But HITL done wrong is worse than no automation at all, because it creates the illusion of oversight while delivering rubber-stamp fatigue.

The rubber stamp problem

Put a human in front of 200 items to review. The first 20 get genuine scrutiny. Items 21 through 50 get a quick glance. By item 100, they're clicking "approve" on autopilot. This isn't laziness — it's cognitive reality. When 97% of items are correct, the human brain stops looking for the 3% that aren't.

I watched this happen in production. A data operations team was reviewing AI-mapped field mappings — source column names mapped to Openfunds field IDs. The AI was right about 95% of the time. The team approved batches of 50 mappings in under two minutes. They were clicking faster than they could read.

The errors that slipped through? They were the subtle ones. OFST010050 (Fund Legal Name) mapped instead of OFST010060 (Fund Marketing Name). Both plausible. Both wrong in context. Exactly the kind of mistake a rubber-stamping human will miss.

Exception-only triggers

The fix isn't more reviewers or better training. It's fewer items to review. HITL should trigger on genuine exceptions — cases where the system genuinely doesn't know, or where the stakes of being wrong are high enough to warrant interrupting a human.

In our mapping pipeline, we route items to human review based on specific triggers:

Confidence below threshold — the AI's top suggestion scores below 85%. This means it's uncertain, not just slightly less certain.
Ambiguous top candidates — the gap between the first and second suggestion is less than 10 percentage points. The AI can't clearly distinguish between two plausible options.
No match found — the source field doesn't resemble anything in the target schema. This might be a custom field, or it might be genuinely unmappable.
High-stakes fields — regardless of confidence, certain fields always get human review. NAV, ISIN, and legal entity identifiers. Getting these wrong has outsized consequences.

With these triggers, out of 200 mappings, maybe 12-15 hit human review. The reviewer knows every item on their screen is there for a reason. Their attention is focused. Error detection goes up dramatically.

The threshold problem

Setting the right confidence threshold is harder than it sounds. Too low (say, 70%) and you let too many questionable items through automatically. Too high (say, 95%) and you're back to reviewing half the queue.

The temptation is to start conservative and lower the threshold over time as you build trust. This is backwards. Start by understanding your error cost.

If a wrong mapping costs you 5 minutes to fix downstream, you can afford a lower threshold. If it costs you a regulatory filing error, the threshold should be much higher. The number isn't about the AI's capability. It's about your tolerance for being wrong.

We calibrate thresholds per field category. Static reference data (fund name, domicile, currency) has a threshold of 88%. Time-series data (NAV, performance, flows) has a threshold of 92%. Regulatory fields (SFDR classification, PRIIPS risk rating) sit at 95%. These aren't arbitrary — they're tuned against a labelled dataset where we know the cost of each error type.

The feedback loop

Exception-based HITL only works if human decisions feed back into the system. When a reviewer corrects a mapping, that correction should improve future suggestions. When they confirm a borderline suggestion, that's training signal too.

Without this loop, your exception rate stays static. With it, the system gets better and the exception queue gets shorter. We've seen exception rates drop from 8% to under 3% over six months of continuous feedback.

Design for the reviewer

One more thing that gets overlooked: the review interface matters enormously. Don't just show the item and an approve/reject button. Show:

Why it was flagged (low confidence? ambiguous candidates? high-stakes field?)
The top 3 suggestions with their scores
Examples of how this field was mapped in previous files from the same source
The raw source data so the reviewer has context

A reviewer who understands why they're looking at something makes better decisions than one who's just presented with a binary choice. The goal is to make the exception queue feel like detective work, not data entry. One is engaging. The other leads to rubber stamps.

Kairos

Kairo Voice Agent

See Kairo in action

We'll walk through your actual data workflow.

Request a Demo