The Openfunds mapping challenge
Openfunds defines 350+ standardised field IDs for fund data. Beautiful in theory. Then you receive a CSV where the column header says "Fund Ccy" and you need to figure out if that's OFST010020 (Fund Currency) or OFST010070 (NAV Currency). Good luck.
The Openfunds standard is genuinely valuable. It gives the industry a common language for fund attributes. The problem isn't the standard — it's the gap between the standard and reality. No provider sends you data in Openfunds format. They send you their format, and you map it.
At my previous company, we had mapping spreadsheets for every provider. Hundreds of them. Each maintained by the person who originally onboarded that provider, often years ago. When that person left, their mapping knowledge left with them.
Why mapping is harder than it looks
Some mappings are obvious. A column called "ISIN" maps to OFST020010 (Fund Share Class ISIN). Even an intern gets that one.
Then it gets interesting. Consider these real column headers from different providers:
NAV— Is this NAV per share (OFST040010) or total NAV / AUM (OFST040020)?Currency— Fund base currency? Share class currency? NAV calculation currency? Openfunds has separate fields for each.Fee— Management fee? TER? Ongoing charges? Entry fee? Each is a different Openfunds field.Category— Provider's proprietary category? Morningstar category? SFDR classification? Asset class?Launch Date— Fund inception date (OFST060060)? Share class launch date? First NAV date?
Context matters enormously. "NAV" in a file from a fund administrator probably means NAV per share. "NAV" in a file from a data aggregator might mean AUM. You need to look at the actual values, the other columns present, and sometimes the file naming convention to disambiguate.
Where AI helps (and where it doesn't)
This is a problem AI is genuinely good at. Give a language model the column header, a sample of values, the other columns in the file, and the Openfunds field catalogue, and it can make an intelligent suggestion. "Based on the header 'Fund Ccy', the values being ISO currency codes, and the presence of a separate 'NAV Ccy' column, this maps to OFST010020 (Fund Currency) with 0.94 confidence."
That 0.94 matters. It's not 1.0. The AI is saying "I'm fairly sure, but check me." And that's exactly right.
Where AI falls down: inventing Openfunds field IDs that don't exist. We've seen models confidently suggest OFST999999 — a field that's entirely hallucinated. This is why the AI mapper doesn't get to decide alone.
The three-layer defence
We built three layers of protection against bad mappings:
- Constrained output: The AI can only suggest field IDs that exist in the Openfunds catalogue. It picks from a fixed list, not free text. This eliminates hallucinated IDs entirely.
- Confidence thresholds: Above 0.90 = auto-accept (but log it). Between 0.70 and 0.90 = suggest to user, require confirmation. Below 0.70 = flag as "needs manual mapping." No guessing.
- Validation against data: After mapping, check the values against the field's expected format. If
OFST010020(Fund Currency) is mapped but the values are decimal numbers, something is wrong. Catch it before it reaches the store.
An AI that's right 95% of the time and wrong 5% of the time is not a 95% solution. It's a system that introduces errors 1 in 20 times unless you build guardrails.
The mapping UX
When a new provider file arrives for the first time, the mapper presents each column with its AI suggestion, confidence score, and sample values. The user reviews, approves, or corrects each mapping. This takes 5-10 minutes for a typical 30-column file.
Here's the key: once confirmed, the mapping locks. The next time a file from this provider arrives with the same schema, the locked mapping applies automatically. No AI involved. No chance of variation. Deterministic execution from the second file onwards.
The mapping only unlocks when the schema changes — new columns appear, columns disappear, or column order shifts. Then the mapper re-engages for the new columns only, keeping existing mappings intact.
Custom fields
Not everything fits Openfunds. Providers send proprietary fields that have no standard equivalent. Provider risk scores. Internal ratings. Custom categories.
We handle these as custom fields with a CUSTOM_ prefix. They flow through the pipeline like any other field, store in the canonical model, and deliver to outbound pipes. The AI mapper learns to recognise these too — if a column consistently gets mapped to a custom field, the confidence score for that mapping increases over time.
The Openfunds standard solved the naming problem. The mapping problem remains. AI makes it 10x faster, but only when you constrain it, score it, and verify it. Trust but verify, then lock it down.