Fund identifier resolution is a graph problem
A single fund share class can have an ISIN, a SEDOL, a Bloomberg ticker, a Reuters RIC, a Morningstar SecId, a Lipper ID, and an internal code that someone invented in 2014. Your lookup table just caught fire.
I spent years maintaining identifier mapping tables. Big spreadsheets, later database tables, with columns like isin, sedol, bloomberg_ticker, and a prayer that someone would keep them updated. They never stayed updated.
The fundamental issue: lookup tables assume a one-to-one relationship between identifiers. Reality is messier. An ISIN like LU0292097234 maps to a share class. That share class belongs to a fund. That fund has an LEI. The management company has a different LEI. The Bloomberg ticker DBXESG LN points to a listing, not the share class itself. These are different things connected by relationships.
Why lookup tables break
Consider a real scenario. You receive NAV data from Provider A keyed by ISIN. Provider B sends performance data keyed by SEDOL. Provider C uses their own internal ID. Your client wants output keyed by Bloomberg ticker.
In a lookup table world, you need every possible pair mapped. That's O(n²) relationships. Miss one mapping and the data silently fails to join. Worse, identifiers change. SEDOLs get reassigned. ISINs occasionally have typos in source files. A flat lookup gives you no way to express confidence or provenance.
The graph approach
Model identifiers as nodes. Relationships between them as edges. Each edge carries metadata: source (who told us these are linked), confidence (how sure are we), and timestamp (when was this asserted).
Now your identifier resolution becomes a graph traversal. Given SEDOL: B1YGK70, walk the graph to find the share class node, then walk outward to find the ISIN, the Bloomberg ticker, the LEI — whatever you need. Multiple paths to the same destination? That's corroboration, not a bug.
- Node types: Share class, fund, umbrella, management company, listing, index
- Edge types:
HAS_ISIN,HAS_SEDOL,LISTED_ON,MANAGED_BY,SHARE_CLASS_OF - Edge metadata: source provider, confidence score (0-1), first seen, last confirmed
Resolution in practice
When a new file arrives with an unfamiliar identifier, the resolver does a few things in sequence:
- Exact match: Is this identifier already a node in the graph? If yes, we're done.
- Fuzzy match: Does the fund name plus domicile narrow it to a candidate cluster? Score it.
- External enrichment: Can we call the LEI registry or OpenFIGI to confirm?
- Human review: If confidence is below threshold, flag it. Don't guess.
Each step that succeeds adds edges to the graph, making future resolutions faster. The graph learns. A lookup table just sits there.
The confidence question
Not all identifier links are equal. An ISIN confirmed by three independent providers and the LEI registry has near-perfect confidence. An internal code that appeared in one file from one provider six months ago? Maybe 0.6.
The worst thing an identifier system can do is silently match the wrong entity. A 0.95 confidence match that's wrong corrupts every downstream calculation.
We set a hard threshold: below 0.85 confidence, the match goes to a review queue. Above 0.85, it proceeds but logs the resolution path. Every automated decision is auditable. You can ask "why did you link IE00BF4RFH31 to this fund?" and get a real answer — not just "it was in the spreadsheet."
Scale matters
A typical asset manager might have 500 funds with 2,000 share classes. Each share class has 5-10 identifiers. That's 10,000-20,000 identifier nodes with 20,000-40,000 edges. A graph database handles this without breaking a sweat.
But the real value shows at industry scale. When you're processing data from dozens of providers across thousands of funds, the graph becomes a shared resolution layer. Provider A's mapping helps resolve Provider B's data. The network effect compounds.
Lookup tables don't have network effects. They have maintenance costs.
We built Kairo's identifier layer as a graph from day one. It was more work upfront than a mapping table. Six months in, it's the single best architectural decision we made. Every new data source that connects makes the graph smarter. Every resolution path is traceable. And nobody maintains a spreadsheet.