Catching discrepancies across data sources
Provider A says the NAV for LU0292097234 is 142.37. Provider B says 142.73. One of them transposed two digits. You won't catch this with single-source validation.
Most fund data validation is single-source: check the file format, verify required fields are present, confirm the NAV is within a tolerance band of yesterday's value. That catches obvious errors — a NAV of zero, a missing ISIN, a date in the wrong format.
It doesn't catch the subtle errors. The ones where the data looks plausible from one source but contradicts another. These are the errors that survive, propagate to downstream consumers, and surface weeks later as a client complaint or a regulatory query.
The prerequisite nobody talks about
Cross-source comparison sounds simple. Take the same data point from two providers and compare. But here's the problem nobody addresses first: how do you know it's the same fund?
Provider A identifies funds by ISIN. Provider B uses their internal ID. Provider C uses a Bloomberg ticker. Before you can compare their NAVs, you need to resolve all three identifiers to the same entity. If your identifier resolution is wrong, your cross-source comparison is worse than useless — it's confidently comparing two different funds.
This is why we built the identifier graph before the comparison engine. You can't do step two without step one. Every platform that tries to bolt on cross-source checking without proper entity resolution ends up with a spreadsheet of "manual overrides" that grows until someone quits.
What to compare and how
Not all data points are worth comparing across sources. Static reference data — fund name, domicile, inception date, currency — should be identical. A mismatch here usually means one source is stale or wrong. Flag it, investigate, update.
Time series data — NAVs, AUM, performance — requires tolerance. Providers calculate slightly differently. Rounding varies. Pricing timestamps differ. A 0.01% NAV difference between two providers might be rounding. A 0.5% difference is a problem.
We use tiered thresholds:
- Exact match fields: ISIN, currency, domicile, legal structure. Must agree. Any mismatch is an exception.
- Tolerance match fields: NAV (0.02% tolerance), AUM (0.5% tolerance), performance figures (0.05% tolerance). Within tolerance = fine. Outside = exception.
- Soft match fields: Fund name, management company name. Fuzzy string matching because providers spell things differently. "BlackRock" vs "Blackrock" vs "BLACKROCK" shouldn't trigger an alert.
A real example
Here's a scenario from production. Three providers send daily NAV data for a Luxembourg UCITS fund:
- Provider A (fund administrator): NAV = 142.3700, as-of date = 2026-03-25
- Provider B (data vendor): NAV = 142.3700, as-of date = 2026-03-25
- Provider C (transfer agent): NAV = 141.8200, as-of date = 2026-03-25
Providers A and B agree. Provider C is off by 0.39%. That's well outside tolerance. The comparison engine flags it immediately.
Investigation reveals Provider C is reporting the previous day's NAV with today's date — a stale data issue. Without cross-source comparison, this would have been published as current. With it, an operations analyst resolves the exception in 2 minutes.
The most dangerous data errors are the ones that look right from any single source. You only see them when sources talk to each other.
Source hierarchy
When sources disagree, someone has to decide who wins. We use a configurable source hierarchy: the fund administrator is typically the golden source for NAV, the management company for static data, and regulatory filings for legal structure.
But hierarchy isn't always the answer. Sometimes the "golden source" is wrong. A fund admin sends a file with a fat-finger error. If two other sources disagree with the admin, maybe the admin needs checking. We surface these as high-priority exceptions rather than auto-accepting the hierarchy winner.
The rule: two sources agreeing against the golden source is a signal, not an override. It goes to human review with full context — here are the values, here are the sources, here's the historical pattern. The analyst decides.
Scaling comparison
With 3 providers and 1,000 share classes, you're running 3,000 comparisons per data point per day. With 10 data points, that's 30,000 comparisons. This has to be fast and automated — you can't have humans scanning spreadsheets.
The comparison engine runs as part of the ingestion pipeline. As soon as a new value arrives, it checks against existing values from other sources for the same entity and date. Discrepancies surface in the exception queue within seconds of file processing, not hours later when someone remembers to check.
Single-source validation catches formatting errors. Cross-source comparison catches data errors. If you're only doing the first, you're catching the easy problems and missing the ones that actually matter.