Web scraping as a data source for fund data
Not every data provider gives you a nice SFTP drop or a REST API. Some of them publish fund factsheets as PDFs on their website and call it a day. When that's your only source for NAV data on 200 share classes, you scrape.
I've seen teams spend months negotiating API access with a transfer agent, only to discover the "API" is a manually-triggered CSV export emailed to a shared inbox. At some point you accept reality: the data is on a web page, and you need it in your pipeline.
That doesn't mean you do it recklessly.
The legal question you can't ignore
First things first: scraping publicly available data is not inherently illegal, but it's not a free-for-all either. The hiQ v. LinkedIn ruling in the US established that scraping public data isn't a CFAA violation. But you're in fund data, which means you're probably dealing with European providers and GDPR territory.
The practical rules we follow:
- Only scrape data that's publicly accessible without authentication
- Respect
robots.txtdirectives - Never scrape personal data - fund-level data (ISINs, NAVs, TERs) is not personal data
- Rate limit aggressively - you're not trying to DDoS a fund admin's website
- Cache everything - don't re-scrape what hasn't changed
If a provider explicitly prohibits scraping in their terms of service, we respect that and find another path. Usually that means picking up the phone and having the commercial conversation.
Rate limiting is non-negotiable
A fund house's website is not built to handle a bot hammering it for 3,000 share class pages in 30 seconds. We learned this the hard way early on - a scraper that ran too fast triggered a WAF ban, and suddenly our client's legitimate browser access to the provider's portal was blocked too. Same IP range.
Our scrapers enforce a minimum delay of 2000ms between requests by default, with exponential backoff on any non-200 response. For larger runs, we spread requests across a 4-hour window overnight. Nobody notices. Nobody cares.
The goal is to be invisible, not fast. A scraper that gets blocked is worse than no scraper at all.
Change detection saves everything
The most expensive part of scraping isn't the HTTP requests. It's what happens when the page structure changes and your parser silently starts returning garbage.
We've seen a provider redesign their fund page and suddenly ISIN: LU0123456789 was being parsed as the page title. The NAV field picked up a copyright notice. Everything downstream - validation, mapping, delivery to Kurtosys - was polluted before anyone noticed.
Now every scraper has a structural fingerprint. We hash the DOM skeleton (tag hierarchy, class names, key selectors) and compare it before parsing. If the fingerprint changes, the scraper pauses and raises an alert. A human reviews the new structure, updates the selectors, and we resume.
What we actually scrape in practice
The most common scraping targets in fund data:
- Fund factsheet PDFs - monthly performance, top holdings, asset allocation. These are PDFs linked from a listing page. You scrape the page to get the PDF URLs, then parse the PDFs separately.
- NAV tables - some smaller fund houses publish daily NAVs on a single HTML table. One page, 50 share classes, updated by 10am. Beautiful.
- KID/PRIIP documents - regulatory documents with structured risk indicators and cost data. Usually linked from a document library page.
- Fund listings - the master list of what funds exist, their ISINs, launch dates, and base currencies. This is often the only machine-readable source for a provider's fund universe.
The pipeline integration
A scraper is just another ingest source. In Kairo, it produces the same normalised intermediate format as an SFTP file drop or an API pull. The downstream pipeline doesn't know or care where the data came from. It sees rows with ISINs, field values, and a source label like provider-web-scrape.
This is important because scraping is inherently less reliable than a structured feed. The data still goes through the same validation rules, the same confidence scoring, the same conflict resolution. If a scraped NAV disagrees with a NAV from a direct feed, the direct feed wins. Always.
Scraping is a tool of last resort that we've made reliable enough to be a tool of first resort for certain providers. The trick is treating it with the same engineering rigour as any other data source - not as a hacky script someone wrote on a Friday afternoon and forgot about.
Because that Friday afternoon script? It's the one that breaks at 3am on a Monday when the client needs their data by 7am. Every time.