Confidence scores are a trust contract
Your AI mapper says it's 94% confident that the source column "Fund_Nm" maps to the Openfunds field OFST010050 (Fund Legal Name). What does 94% actually mean? If you can't answer that precisely, the number is decoration.
Confidence scores are everywhere in AI-assisted products. They feel reassuring. A high number in green, a low number in amber. But most confidence scores are model logits dressed up as probabilities, and they tell you almost nothing about real-world accuracy.
The calibration problem
A well-calibrated confidence score means: of all the times the system says "94%," it should be correct roughly 94% of the time. Not 100%. Not 80%. Ninety-four percent.
Most ML models are not calibrated this way out of the box. A language model's raw output probabilities tend to be overconfident — it'll say 95% when it's really right about 85% of the time. Or underconfident for easy cases — saying 80% for something it gets right 99% of the time.
We discovered this the hard way. Early in development, our field mapper would assign 90%+ confidence to nearly everything. It was right most of the time, but when it was wrong at 92% confidence, users lost trust in the entire system. If 92% doesn't mean "almost certainly correct," then no number means anything.
How we calibrate
We maintain a labelled evaluation set: 2,000+ field mappings across 40 different source file formats, with verified correct answers. Every time we update the mapping model or prompt, we run the full eval set and plot a calibration curve.
The curve should be a diagonal line: 70% confidence = 70% accuracy, 90% confidence = 90% accuracy, and so on. When the curve bows above the diagonal, the model is underconfident. Below, overconfident. We apply temperature scaling to bring it back to the diagonal.
This sounds academic. It isn't. Calibration is what makes confidence scores actionable. If your operations team knows that 90% means "correct 9 out of 10 times," they can make rational decisions about what to review and what to auto-approve. If 90% means "somewhere between 75% and 98% accurate depending on the day," they'll review everything. Which defeats the purpose.
The display problem
Even with perfect calibration, showing users a raw percentage is often wrong. People are terrible at interpreting probabilities. "94%" feels almost identical to "97%," but in a batch of 500 mappings, that difference is 15 additional errors.
We went through several iterations:
- Raw percentages — precise but overwhelming. Users fixated on the difference between 91% and 93%, which is noise.
- Three-tier buckets — High / Medium / Low. Too coarse. Users wanted more granularity for borderline cases.
- Colour-coded percentages with tier labels — what we settled on. Green with "Auto-approved" above 90%. Amber with "Review suggested" between 75-90%. Red with "Manual mapping required" below 75%.
The key insight was labelling the action, not just the number. Users don't care whether the score is 87% or 89%. They care whether they need to look at it.
Composite confidence
A single mapping decision actually involves multiple signals, and collapsing them into one number loses information. We compute sub-scores:
- Semantic similarity — does the source column name resemble the target field name?
- Data pattern match — does the actual data in the column look like what the target field expects? (ISINs have a recognisable pattern. Currency codes are 3-letter strings from a known set.)
- Historical precedent — have we seen this exact source column mapped before, from this same data provider?
The overall score is a weighted combination of these. But when a user clicks into a mapping, they can see the breakdown. A 91% score where semantic similarity is 95% and data pattern is 98% but historical precedent is 40% tells a story: this looks right, but we've never seen this provider use this column name before. That's useful context a single number can't convey.
A confidence score without context is just a number. A confidence score with an explanation is a decision-support tool.
The trust contract
Here's what I mean by "trust contract": when you show a confidence score, you're making an implicit promise to the user. You're saying "this number is meaningful, and you can use it to decide how much attention to give this item."
Break that promise — show 95% on something that's clearly wrong — and you've lost the user. Not for that one item. For every item. They'll start manually checking everything regardless of the score, and your AI-assisted workflow becomes a manual workflow with extra UI.
Honour the contract — show calibrated scores that match real-world accuracy, with clear action labels and explainable sub-scores — and users develop genuine trust. Not blind trust. Informed trust. The kind where they auto-approve the 92% items without anxiety and focus their attention on the 78% items where the system genuinely needs help.
That's the difference between AI that augments human work and AI that creates more of it.