Error handling in data pipelines at scale
Most data pipelines have two error states: success and failure. That's not enough. When you're processing hundreds of files daily across dozens of sources, you need three layers of error granularity — and the layer you handle errors at determines whether your ops team can actually triage problems or just stare at a wall of red alerts.
The binary error problem
Here's what happens with naive error handling. A pipeline processes a CSV with 500 rows of NAV data. Row 347 has an invalid date format — 03/15/2026 instead of 2026-03-15. The pipeline throws an exception. The entire file fails. An ops analyst gets an alert that says "Pipeline failed: invalid date format." They open the file, scroll to row 347, fix the date, re-upload, and wait for the pipeline to run again.
The other 499 rows were perfectly valid. They're now delayed because of one bad date in one row.
This is how most data pipelines work. It's also why ops teams hate their data pipelines.
Layer 1: File-level errors
Some errors genuinely affect the entire file. These should fail the whole pipeline — but with specific, actionable messages:
- Schema mismatch. The file has 12 columns but the mapping expects 15. Someone changed the export template. This needs human investigation — you can't process partial schemas safely.
- Encoding errors. The file claims to be UTF-8 but contains Windows-1252 characters. Every value is potentially corrupted. Reject the whole thing.
- Missing identifiers. The file has no ISIN column, no fund name column, nothing to join on. There's no way to match these rows to known entities. Full stop.
- Zero rows. The file arrived empty, or the header row is present but there's no data. This usually means the provider's export broke. Alert and wait for a re-send.
File-level errors should be rare and loud. When they happen, they mean something is structurally wrong with the source, and the right response is usually to contact the data provider.
Layer 2: Column-level errors
This is the layer most pipelines miss entirely, and it's arguably the most important for operational efficiency.
A column-level error means: this field exists and has data, but the data doesn't match what we expect for this field type across the entire file. Examples:
- The "Currency" column contains
EUROinstead ofEUR— every row has the same problem because the source is using a non-standard code - The "NAV Date" column uses MM/DD/YYYY format instead of the expected ISO 8601 — systematic, not a one-off typo
- A new column appeared that doesn't exist in the mapping configuration — the provider added a field without telling you (this happens constantly)
Column-level errors are distinct from row-level errors because they're systematic. The fix is usually a mapping adjustment or a transformation rule, not a data correction. Surfacing them separately means your ops team can say "ah, they changed the currency code format" and update a single rule, rather than investigating each row individually.
The difference between a good data platform and a bad one is often just this: does it tell you the problem is systematic or isolated?
Layer 3: Row-level errors
Row-level errors are the most granular: a specific value in a specific row failed a specific validation. The key design decision here is what to do with the rest of the file.
In Kairo, we process valid rows and quarantine invalid ones. Row 347 has a bad date? The other 499 rows flow through the pipeline normally. Row 347 gets flagged with the specific validation failure, the field that failed, and the value that caused it. An analyst can review quarantined rows, fix them, and re-process just those rows.
This sounds obvious, but implementing it properly requires careful thought:
- Partial success must be visible. If 499 of 500 rows succeeded, the pipeline status should show "499/500 — 1 quarantined" not just "success." The analyst needs to know there's something to review.
- Quarantined rows must be re-processable. Not "download the original file, fix it, re-upload everything." Just the failed rows, through the same pipeline, with the same rules.
- Error context must be specific. "Validation failed" is useless. "Row 347, column 'NAV_Date', value '03/15/2026': expected ISO 8601 date format (YYYY-MM-DD)" is actionable.
The triage hierarchy
With three layers, your morning triage workflow becomes structured:
- Check file-level failures first. These are the fires. A file didn't arrive, or arrived corrupted. Contact the provider. This takes 5 minutes.
- Review column-level issues. A provider changed their format. Update the mapping rule. This takes 10 minutes and fixes the problem permanently.
- Handle row-level exceptions. Individual data quality issues. Review quarantined rows, fix or escalate. This is the steady-state work.
Without this hierarchy, everything is a flat list of errors. An ops analyst looking at 47 alerts can't tell which ones are fires and which ones are routine. They either panic about everything or — more dangerously — stop checking the alerts entirely because the noise level is too high.
What we learned from running hundreds of pipelines
After processing hundreds of data feeds, some patterns emerged:
80% of file-level failures are provider-side. Changed schemas, late deliveries, encoding issues. Your error messages for these should be designed to copy-paste into an email to the provider.
Column-level issues cluster around format changes. When they happen, they affect every file from that source going forward. Detecting them early — ideally on the first file after the change — saves days of downstream cleanup.
Row-level errors follow a power law. A small number of funds cause the majority of data quality issues. Knowing which funds are problematic lets you apply stricter validation where it matters and lighter validation where the data is consistently clean.
Error handling isn't glamorous. Nobody puts it on a conference slide. But the difference between an ops team that's in control and one that's drowning usually comes down to whether their error handling gives them signal or noise.