What we learned running fund data infrastructure at scale
I spent seven years building and operating fund data infrastructure at an enterprise platform that served some of the largest asset managers in Europe. Here's what I wish someone had told me on day one.
Most of these lessons aren't technical. The hardest problems in fund data aren't engineering problems. They're coordination problems wearing a technical disguise.
The data is never as clean as they say it is
Every new provider onboarding starts the same way. "We have clean, structured data. Just connect to our SFTP and pull the daily file." Then you open the file.
The ISIN column contains a mix of ISINs and SEDOLs. The NAV field uses commas as decimal separators in some rows and periods in others. There's a share class with a launch date of 1900-01-01 because someone needed a placeholder and never came back to fix it.
This isn't an exception. This is the baseline. Every single provider, without fail, has data quality issues they don't know about. The question isn't whether you'll find problems. It's whether you have the tooling to surface them before they reach your clients.
Automation fails at the edges
We automated about 85% of data processing at the enterprise platform. File ingestion, parsing, validation, normalisation, delivery - all automated. The remaining 15% consumed 80% of the operations team's time.
That 15% is where fund data gets weird:
- A fund merges with another fund, and the surviving ISIN inherits data from both. Which NAV history do you keep?
- A provider starts sending a new column with no documentation. Is it a TER or an OCF? They look similar but mean different things in different jurisdictions.
- A share class changes its base currency. Every historical data point needs to be understood in context - was it reported in the old currency or the new one?
- Two providers report different domiciles for the same fund. One says Luxembourg, the other says Ireland. Both are technically correct - the fund is domiciled in Luxembourg but the share class is registered in Ireland.
These edge cases require domain expertise. You can build decision trees for some of them, but the long tail is infinite. The right answer is a system that handles the common cases automatically and surfaces the weird ones to humans with enough context to decide quickly.
The real bottleneck is onboarding
At the enterprise platform, onboarding a new data provider took an average of 6 weeks. Not because the integration was complex - most of it was SFTP + CSV parsing. The time went to:
- Getting sample files (2 weeks of emails)
- Understanding the schema (no documentation, reverse-engineer from data)
- Building the field mapping (manual, column by column)
- Validation rule configuration (which checks apply to this provider's data?)
- UAT with the client (3 rounds of "that field is wrong")
The actual engineering work was maybe 2 days. The rest was waiting for people to respond to emails and review spreadsheets.
This is what we're attacking with Kairo. AI-assisted mapping cuts the schema understanding and field mapping from days to minutes. But the fundamental bottleneck - human coordination - needs a different solution: self-service tooling that lets clients see and approve mappings themselves.
Scale changes everything about conflict resolution
When you have 5 data providers, conflicts are manageable. You call someone, figure out which value is correct, update the record. When you have 50 providers covering 100,000 share classes, you need a system.
We built priority hierarchies. The fund administrator's data beats the asset manager's data. The asset manager's data beats the third-party data vendor. Regulatory filings beat everything. These rules resolved 90% of conflicts automatically.
The remaining 10% still needed human review. But the system could tell you exactly which fields conflicted, which sources disagreed, and what the historical pattern was. A conflict that used to take 30 minutes to investigate took 2 minutes with the right tooling.
What I'd do differently
If I could go back and redesign the enterprise platform from scratch:
- Schema-first, not file-first. We spent too long building per-file parsers instead of defining the target schema and mapping everything to it.
- Immutable data pipelines. Every transformation should be replayable. We lost too many hours debugging "what happened to this NAV value" because intermediate states weren't preserved.
- Confidence scores from day one. Not every data point is equally trustworthy. A NAV from a fund admin at T+1 is better than a NAV from a data vendor at T+0. The system should know this.
- Smaller teams, better tooling. We had 15 people in data operations. Most of their work was compensating for inadequate tooling. Five people with the right tools would have been faster.
That last point is basically the thesis behind Kairo. The fund data industry doesn't have a people problem. It has a tooling problem.