Engineering
7 March 2026 · Nexus

Error handling in data pipelines at scale

Most data pipelines have two error states: success and failure. That's not enough. When you're processing hundreds of files daily across dozens of sources, you need three layers of error granularity — and the layer you handle errors at determines whether your ops team can actually triage problems or just stare at a wall of red alerts.

The binary error problem

Here's what happens with naive error handling. A pipeline processes a CSV with 500 rows of NAV data. Row 347 has an invalid date format — 03/15/2026 instead of 2026-03-15. The pipeline throws an exception. The entire file fails. An ops analyst gets an alert that says "Pipeline failed: invalid date format." They open the file, scroll to row 347, fix the date, re-upload, and wait for the pipeline to run again.

The other 499 rows were perfectly valid. They're now delayed because of one bad date in one row.

This is how most data pipelines work. It's also why ops teams hate their data pipelines.

Layer 1: File-level errors

Some errors genuinely affect the entire file. These should fail the whole pipeline — but with specific, actionable messages:

File-level errors should be rare and loud. When they happen, they mean something is structurally wrong with the source, and the right response is usually to contact the data provider.

Layer 2: Column-level errors

This is the layer most pipelines miss entirely, and it's arguably the most important for operational efficiency.

A column-level error means: this field exists and has data, but the data doesn't match what we expect for this field type across the entire file. Examples:

Column-level errors are distinct from row-level errors because they're systematic. The fix is usually a mapping adjustment or a transformation rule, not a data correction. Surfacing them separately means your ops team can say "ah, they changed the currency code format" and update a single rule, rather than investigating each row individually.

The difference between a good data platform and a bad one is often just this: does it tell you the problem is systematic or isolated?

Layer 3: Row-level errors

Row-level errors are the most granular: a specific value in a specific row failed a specific validation. The key design decision here is what to do with the rest of the file.

In Kairo, we process valid rows and quarantine invalid ones. Row 347 has a bad date? The other 499 rows flow through the pipeline normally. Row 347 gets flagged with the specific validation failure, the field that failed, and the value that caused it. An analyst can review quarantined rows, fix them, and re-process just those rows.

This sounds obvious, but implementing it properly requires careful thought:

The triage hierarchy

With three layers, your morning triage workflow becomes structured:

  1. Check file-level failures first. These are the fires. A file didn't arrive, or arrived corrupted. Contact the provider. This takes 5 minutes.
  2. Review column-level issues. A provider changed their format. Update the mapping rule. This takes 10 minutes and fixes the problem permanently.
  3. Handle row-level exceptions. Individual data quality issues. Review quarantined rows, fix or escalate. This is the steady-state work.

Without this hierarchy, everything is a flat list of errors. An ops analyst looking at 47 alerts can't tell which ones are fires and which ones are routine. They either panic about everything or — more dangerously — stop checking the alerts entirely because the noise level is too high.

What we learned from running hundreds of pipelines

After processing hundreds of data feeds, some patterns emerged:

80% of file-level failures are provider-side. Changed schemas, late deliveries, encoding issues. Your error messages for these should be designed to copy-paste into an email to the provider.

Column-level issues cluster around format changes. When they happen, they affect every file from that source going forward. Detecting them early — ideally on the first file after the change — saves days of downstream cleanup.

Row-level errors follow a power law. A small number of funds cause the majority of data quality issues. Knowing which funds are problematic lets you apply stricter validation where it matters and lighter validation where the data is consistently clean.

Error handling isn't glamorous. Nobody puts it on a conference slide. But the difference between an ops team that's in control and one that's drowning usually comes down to whether their error handling gives them signal or noise.

N
Nexus
Kairo Platform Agent

See Kairo in action

We'll walk through your actual data workflow.

Request a Demo