Extracting structured data from fund documents
Somewhere right now, an operations analyst is manually keying SRRI risk ratings from a KIID PDF into a spreadsheet. They've done it hundreds of times. They'll get it right 98% of the time. That remaining 2% will cost someone real money.
Fund documents are a fascinating problem. They contain extremely precise, regulated data — NAVs, ongoing charges, risk indicators, benchmark names — wrapped in layouts that vary wildly between asset managers. One firm's factsheet puts the ISIN in the header. Another buries it in a footnote. A third splits it across two columns for reasons known only to God and their design agency.
The template trap
The traditional approach is template-based extraction. You build a parser for each document layout. You define coordinates, regex patterns, anchor text. It works beautifully — until the asset manager updates their template in Q2 and nobody tells you.
I've seen teams maintain 200+ templates. Each one is a little state machine that breaks in its own special way. A font change breaks OCR confidence. A new disclaimer paragraph shifts every coordinate down by 40 pixels. The maintenance burden is linear with the number of sources, and it never gets smaller.
Template extraction is precise but brittle. When it works, the accuracy is near-perfect. When it breaks, it breaks silently — returning the wrong cell value with full confidence.
LLMs change the economics
Large language models flip the problem. Instead of encoding layout rules, you describe what you want: "Extract the ISIN, base currency, ongoing charges figure, and SRRI rating from this KIID." The model handles layout variation implicitly.
We tested this extensively. On a corpus of 500+ KIIDs across 30 asset managers, an LLM-based extractor achieved 96% field-level accuracy out of the box. No templates. No coordinates. Just a prompt and a PDF.
That sounds great until you think about what 4% error means at scale. If you're processing 10,000 documents a month, that's 400 fields with potential errors. In fund data, a wrong ongoing charges figure of 1.45% instead of 0.145% — a simple decimal place — is a regulatory problem.
The hybrid that actually works
What we've landed on is a layered approach:
- LLM extraction first pass — handles layout variation, returns structured fields with per-field confidence scores
- Validation rules second pass — ISIN checksum validation, currency code lookup, numeric range checks (an ongoing charge of 14.5% is almost certainly wrong)
- Cross-reference third pass — compare extracted values against existing data. If we already have the NAV for
LU0599946893from a data feed, and the PDF extraction disagrees by more than 1%, flag it - Human review for exceptions only — not every document, just the ones where confidence is below threshold or validation fails
The key insight: extraction accuracy matters less than detection accuracy. You don't need the LLM to be perfect. You need to know when it's wrong.
The fields that trip everyone up
Some fields are easy. ISINs have a checksum. Currency codes come from a fixed list. NAVs are unambiguous numbers.
The hard ones are semi-structured text fields. Benchmark names are a nightmare — is it "MSCI World Index" or "MSCI World (Net Return)" or "MSCI World NR USD"? These are all the same benchmark, but string matching says they're three different things.
Performance figures require understanding context: is this a cumulative return or annualised? Calendar year or rolling? Gross or net of fees? The number 12.4% means completely different things depending on these qualifiers.
The document always contains the answer. The problem is that it assumes a human reader who understands the context that surrounds the number.
What we've learned
After processing tens of thousands of fund documents, a few hard-won lessons:
Don't try to extract everything. Start with the 15-20 fields that matter for your use case. Expanding scope is easy once the pipeline is proven.
Version your extraction prompts like code. When you tune a prompt to handle a new edge case, you need to know it didn't break three old ones. We run regression tests against a golden set of documents with known correct values.
PDFs are not all born equal. Digitally-generated PDFs give you clean text layers. Scanned documents need OCR first, which adds its own error rate. Know which you're dealing with before you choose your approach.
The end game isn't replacing humans. It's giving them a pre-filled, validated, confidence-scored result where they spend 10 seconds confirming instead of 10 minutes typing. At 5,000 documents a month, that's the difference between a team of eight and a team of two.