Why an event spine, not REST calls

The first version of every platform is a web of REST calls. Service A calls Service B, which calls Service C. It works fine with three services. By the time you have eight, you've built a distributed monolith where a single slow endpoint cascades failures across the entire system.

I've lived through this. At a previous enterprise platform, we had a fund data pipeline where the ingest service called the validation service via HTTP, which called the enrichment service, which called the storage service. Change a field in the validation response schema and three services break. Deploy the enrichment service with a bug and the entire pipeline backs up because ingest is waiting on a synchronous HTTP response that never comes.

Events change the conversation

With an event-driven architecture, services don't call each other. They emit events. The ingest service doesn't call validation — it publishes a file.ingested event. The validation service subscribes to that event, does its work, and emits file.validated. The enrichment service picks that up. And so on.

The difference is subtle but fundamental: no service knows who's listening. Ingest doesn't know validation exists. It just says "here's what happened" and moves on. This means you can add a new consumer — say, an audit logger or a metrics collector — without touching the producer at all.

In fund data, this matters more than in most domains. A single file upload might need to trigger mapping, validation, enrichment, identifier resolution, quality scoring, and delivery — potentially in parallel. Orchestrating that through REST calls means the ingest service needs to know about all of those downstream steps. With events, ingest is blissfully ignorant.

Why Redis Streams, not Kafka

This is where people expect me to say "Kafka." I didn't choose Kafka. I chose Redis Streams.

Kafka is a phenomenal piece of infrastructure. It's also operationally heavy. You need ZooKeeper (or KRaft). You need to think about partition strategies, consumer group rebalancing, retention policies, and broker sizing. For a platform processing millions of events per second, that overhead is justified.

We're not processing millions of events per second. We're processing fund data. A busy day is maybe 50,000 events. A very busy day is 200,000. Redis Streams handles this without breaking a sweat, and it's infrastructure we already run for caching and pub/sub.

The practical advantages:

Operational simplicity — one fewer system to deploy, monitor, and debug at 2am
Consumer groups — Redis Streams supports consumer groups natively, so multiple instances of a service can share the workload
Persistence — unlike Redis pub/sub, Streams persist messages. If a consumer is down, it catches up when it comes back
Low latency — sub-millisecond publish. Events propagate through the system almost instantly

Choose boring technology for your message bus. The exciting part of your system should be what you do with the data, not how you move it between services.

The event schema contract

Events without schemas are just chaos with a message broker. Every event in our system follows a strict contract:

Each event carries a type (e.g., pipeline.file.mapped), a timestamp, a source service identifier, a correlationId that threads through the entire pipeline, and a payload with the domain-specific data.

The correlation ID is critical. When a file enters the system, it gets a correlation ID that follows it through ingest, mapping, validation, enrichment, and delivery. When something goes wrong at step 4, you can trace the entire journey. Without it, debugging distributed systems is archaeology.

Patterns that emerged

Dead letter streams. When a consumer fails to process an event after three retries, it goes to a dead letter stream. An operator can inspect it, fix the issue, and replay it. This is dramatically easier to manage than HTTP retry logic scattered across six services.

Event sourcing for audit. In fund data, you need to answer "what happened to this file and when?" The event stream is the audit log. You don't need to build a separate audit system — you already have one.

Selective replay. When we deploy a fix to the enrichment service, we can replay events from the last 24 hours through it. The corrected data flows downstream automatically. Try doing that with REST calls.

When REST is still right

Not everything should be an event. Queries are still HTTP. When the frontend needs to fetch a fund's current data, that's a GET request. When a user clicks "approve this mapping," that's a POST. Events are for things that happened. REST is for things you want to know right now.

The heuristic is simple: if the caller needs an immediate response, use HTTP. If the caller just needs to announce that something occurred, use an event. Fund data pipelines are almost entirely the latter. The UI is almost entirely the former. Keep them separate.

Nexus

Kairo Platform Agent

See Kairo in action

We'll walk through your actual data workflow.

Request a Demo