TL;DR — The saga pattern replaces heavyweight two‑phase commits with a series of local transactions and compensating actions, letting you keep eventual consistency across order, inventory, and payment services. By wiring these steps through Kafka (or another durable broker) and making each service idempotent, you can ship a commerce‑grade microservice ecosystem that survives network glitches, partial failures, and scaling pressure.
Modern commerce platforms rarely run on a single monolith any more. Instead, they consist of dozens of independently deployable services—catalog, cart, order, inventory, payment, shipping, fraud, and more. Each service owns its own database, and a single business operation (e.g., “place an order”) must touch several of them. Traditional ACID transactions cannot span these boundaries, so engineers either accept eventual inconsistency or resort to complex two‑phase commit (2PC) solutions that hurt latency and availability.
The saga pattern offers a pragmatic middle ground: break the global transaction into a chain of local transactions, each followed by a compensating transaction that undoes its effects if something later fails. The result is a workflow that is both resilient and observable, while still guaranteeing that the system ends up in a consistent state.
Below we walk through the problem space, the theory behind sagas, two concrete implementation styles (orchestration vs. choreography), a full‑stack example built with Kafka and Spring Boot, and production‑grade patterns you need to bake in from day one.
The Challenge of Distributed Transactions in Commerce
E‑commerce order processing is a textbook case of a distributed transaction:
| Step | Service | Primary DB Write | Reason for Failure |
|---|---|---|---|
| 1 | Order Service | Insert order row (status = pending) | Validation error, DB overload |
| 2 | Inventory Service | Reserve stock | Stock out, race condition |
| 3 | Payment Service | Authorize payment | Card declined, gateway timeout |
| 4 | Shipping Service | Create shipment record | Address validation fail |
| 5 | Notification Service | Send confirmation email | SMTP outage |
If any step fails, the whole business operation must be rolled back. In a monolith you would wrap all five statements in a single SQL transaction. In a microservice world you cannot lock resources across network boundaries without severe performance penalties. The naive alternative—just fire‑and‑forget each call—leads to orphaned reservations, double‑charged cards, or customers receiving no confirmation.
Two‑phase commit (2PC) can enforce atomicity, but it requires a transaction manager that coordinates every participant, holds locks until the global decision is made, and forces all services to speak the same protocol. In practice, 2PC is rarely used in high‑throughput storefronts because:
- Latency: The prepare phase adds at least one round‑trip per participant.
- Availability: If the coordinator crashes, participants remain in a prepared state, potentially causing deadlocks.
- Complexity: Every service must expose a prepare and commit/rollback API, which is at odds with the “own your data” principle.
Enter the saga pattern: it replaces the single global lock with a sequence of independent, locally committed transactions plus explicit compensation logic.
Understanding the Saga Pattern
A saga is a workflow composed of ordered steps. Each step:
- Executes a local transaction that is immediately committed.
- Publishes an event (or sends a command) that triggers the next step.
- Registers a compensating action that can reverse its effects if a later step fails.
If every step succeeds, the saga completes and the business transaction is considered committed. If any step reports an error, the saga rewinds: it runs the compensation actions for all previously successful steps in reverse order.
Orchestration vs. Choreography
There are two canonical ways to drive a saga:
| Aspect | Orchestration | Choreography |
|---|---|---|
| Coordinator | A dedicated Saga Orchestrator (e.g., a state machine) decides the next step. | Each service reacts to events produced by the previous service. |
| Visibility | Centralized view of saga state (e.g., in a DB or Redis). | State is distributed across services; debugging requires tracing events. |
| Complexity | Orchestrator logic can become bulky, but services stay simple. | Services must know the global flow, increasing coupling. |
| Typical Tools | Camunda, Temporal, Netflix Conductor, Spring State Machine. | Apache Kafka, NATS, RabbitMQ with event‑driven listeners. |
For a commerce platform that already uses Kafka as its event backbone, choreography often feels more natural: each service publishes a “X‑completed” event and listens for the “X‑completed” event it needs. However, when you need transactional visibility (e.g., to expose a UI that shows “order is being compensated”), an orchestrator can be layered on top of the same events.
Below we build a hybrid approach: a lightweight orchestrator implemented as a Spring Boot service that persists saga state, while the actual work is still performed by independent microservices via Kafka topics. This gives us a single source of truth for monitoring without sacrificing the decoupling benefits of event‑driven choreography.
Architecture in Practice: A Sample Order Service
High‑Level Diagram
+----------------+ +----------------+ +----------------+ +----------------+
| Order Service | ---> | Inventory Svc | ---> | Payment Svc | ---> | Shipping Svc |
+----------------+ +----------------+ +----------------+ +----------------+
^ ^ ^ ^
| | | |
| Kafka Topic: | Kafka Topic: | Kafka Topic: |
| order.created | stock.reserved | payment.authorized |
| order.failed | stock.released | payment.refunded |
| order.completed | stock.confirmed | payment.captured |
+----------------------+-----------------------+-----------------------+
- Order Service receives the HTTP
POST /ordersrequest, creates anOrderrecord with status pending, and publishesorder.created. - Inventory Service consumes
order.created, attempts to reserve the required SKUs, writes a reservation row, and publishes eitherstock.reservedorstock.unavailable. - Payment Service listens for
stock.reserved. If successful, it authorizes the payment and emitspayment.authorizedorpayment.failed. - Shipping Service reacts to
payment.authorized, creates a shipment, and publishesshipping.created.
If any service emits a failure event, the orchestrator initiates compensation:
payment.failed→stock.releasedstock.unavailable→order.cancelledshipping.failed→payment.refunded→stock.released
All events are persisted in Kafka with log compaction enabled, guaranteeing that a newly started consumer can recover the latest state.
Data Ownership and Idempotency
Each service owns its own table:
-- Order Service
CREATE TABLE orders (
id UUID PRIMARY KEY,
customer_id UUID NOT NULL,
status TEXT NOT NULL, -- pending, confirmed, cancelled, failed
total_amount NUMERIC(12,2),
created_at TIMESTAMP,
updated_at TIMESTAMP
);
-- Inventory Service
CREATE TABLE stock_reservations (
id UUID PRIMARY KEY,
order_id UUID NOT NULL,
sku TEXT NOT NULL,
qty INT NOT NULL,
status TEXT NOT NULL, -- reserved, released
created_at TIMESTAMP,
updated_at TIMESTAMP
);
Idempotency is enforced by deduplicating on the message key (the order_id). When a consumer receives a duplicate stock.reserved event, it looks up the reservation by order_id; if it already exists with status reserved, the handler simply returns success. This pattern eliminates “at‑least‑once” side effects that would otherwise cause double reservations or double refunds.
Implementing Sagas with Kafka and Spring Boot
Below is a minimal, production‑ready skeleton that demonstrates the choreography side (event handling) and the orchestrator side (state machine). The code assumes you are using Spring Boot 3.x, Spring Cloud Stream, and Kafka.
1. Maven Dependencies
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-stream-kafka</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.statemachine</groupId>
<artifactId>spring-statemachine-core</artifactId>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<scope>runtime</scope>
</dependency>
</dependencies>
2. Saga Orchestrator State Machine (Java)
package com.example.saga.orchestrator;
import org.springframework.context.annotation.Configuration;
import org.springframework.statemachine.config.EnableStateMachineFactory;
import org.springframework.statemachine.config.EnumStateMachineConfigurerAdapter;
import org.springframework.statemachine.config.builders.StateMachineStateConfigurer;
import org.springframework.statemachine.config.builders.StateMachineTransitionConfigurer;
@Configuration
@EnableStateMachineFactory
public class SagaStateMachineConfig
extends EnumStateMachineConfigurerAdapter<SagaState, SagaEvent> {
@Override
public void configure(StateMachineStateConfigurer<SagaState, SagaEvent> states) throws Exception {
states
.withStates()
.initial(SagaState.START)
.state(SagaState.INVENTORY_RESERVED)
.state(SagaState.PAYMENT_AUTHORIZED)
.state(SagaState.SHIPPING_CREATED)
.end(SagaState.COMPLETED)
.end(SagaState.COMPENSATED);
}
@Override
public void configure(StateMachineTransitionConfigurer<SagaState, SagaEvent> transitions) throws Exception {
transitions
.withExternal()
.source(SagaState.START).target(SagaState.INVENTORY_RESERVED).event(SagaEvent.INVENTORY_SUCCESS)
.and()
.withExternal()
.source(SagaState.INVENTORY_RESERVED).target(SagaState.PAYMENT_AUTHORIZED).event(SagaEvent.PAYMENT_SUCCESS)
.and()
.withExternal()
.source(SagaState.PAYMENT_AUTHORIZED).target(SagaState.SHIPPING_CREATED).event(SagaEvent.SHIPPING_SUCCESS)
.and()
.withExternal()
.source(SagaState.SHIPPING_CREATED).target(SagaState.COMPLETED).event(SagaEvent.SAGA_SUCCESS)
// Compensation paths
.and()
.withExternal()
.source(SagaState.PAYMENT_AUTHORIZED).target(SagaState.COMPENSATED).event(SagaEvent.PAYMENT_FAILURE)
.and()
.withExternal()
.source(SagaState.INVENTORY_RESERVED).target(SagaState.COMPENSATED).event(SagaEvent.INVENTORY_FAILURE);
}
}
SagaState and SagaEvent are simple enums that map directly to the Kafka topics you’ll emit.
3. Event Producer (Order Service)
package com.example.order;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.messaging.support.MessageBuilder;
import org.springframework.cloud.stream.function.StreamBridge;
import org.springframework.stereotype.Service;
@Service
public class OrderPublisher {
@Autowired
private StreamBridge bridge;
public void publishOrderCreated(OrderDto order) {
bridge.send(
"orderCreated-out-0", // binding name from application.yml
MessageBuilder.withPayload(order)
.setHeader("partitionKey", order.getId())
.build()
);
}
}
Configuration (application.yml):
spring:
cloud:
stream:
bindings:
orderCreated-out-0:
destination: order.created
producer:
partitionKeyExpression: headers['partitionKey']
kafka:
binder:
brokers: ${KAFKA_BOOTSTRAP_SERVERS}
autoCreateTopics: true
4. Compensation Logic (Inventory Service)
package com.example.inventory;
import org.springframework.kafka.annotation.KafkaListener;
import org.springframework.stereotype.Component;
import org.springframework.transaction.annotation.Transactional;
@Component
public class InventoryListener {
@KafkaListener(topics = "order.created", groupId = "inventory")
@Transactional
public void handleOrderCreated(OrderEvent event) {
if (reserveStock(event)) {
publishSuccess(event);
} else {
publishFailure(event);
}
}
private boolean reserveStock(OrderEvent event) {
// Pseudo‑code: check DB, insert reservation row, return true on success
// Must be idempotent: SELECT first, then INSERT if missing
return true;
}
private void publishSuccess(OrderEvent event) {
// send stock.reserved event
}
private void publishFailure(OrderEvent event) {
// send stock.unavailable event
}
@KafkaListener(topics = "payment.failed", groupId = "inventory")
@Transactional
public void compensateStockRelease(PaymentFailedEvent evt) {
// Find reservation by orderId and set status = 'released'
// Idempotent: UPDATE ... WHERE status='reserved'
}
}
Key production notes:
- Transactional boundaries – Each listener runs in a single DB transaction; the Kafka offset is committed only after the transaction succeeds (Spring’s
KafkaTransactionManager). - Exactly‑once semantics – Enable
enable.idempotence=trueon the Kafka producer and configure the broker withtransaction.state.log.replication.factor=3to guarantee that duplicate publishes do not create extra rows. - Dead‑letter handling – For unrecoverable errors (e.g., schema mismatch), forward the message to a
order.dlqtopic for manual inspection.
Patterns in Production: Idempotency, Timeouts, and Monitoring
Idempotent Handlers
Every consumer must be safe to run multiple times. A common technique:
INSERT INTO stock_reservations (id, order_id, sku, qty, status)
SELECT :id, :orderId, :sku, :qty, 'reserved'
WHERE NOT EXISTS (SELECT 1 FROM stock_reservations WHERE order_id = :orderId);
If the row already exists, the INSERT does nothing, and the service can treat the call as a no‑op success.
Timeouts & Cancellation Windows
A saga that stalls indefinitely is as bad as a deadlock. Implement a deadline (e.g., 30 seconds) at the orchestrator level:
@Scheduled(fixedDelay = 5000)
public void checkOpenSagas() {
sagaRepository.findByStateAndCreatedAtBefore(
SagaState.START, Instant.now().minusSeconds(30))
.forEach(this::triggerCompensation);
}
When the timeout fires, the orchestrator emits a saga.timeout event that all services treat as a signal to roll back.
Observability
- Tracing – Propagate a correlation ID (e.g.,
X-Trace-Id) through Kafka headers. Tools like OpenTelemetry can stitch together spans from each microservice. - Metrics – Export counters such as
saga.completed,saga.compensated,saga.failedvia Prometheus. - Dashboard – A simple Grafana panel that shows the ratio of successful vs. compensated orders helps spot upstream bottlenecks (e.g., a flaky payment gateway).
Testing the Whole Flow
- Unit tests for each service’s handler using an in‑memory Kafka (
EmbeddedKafka) and an H2 DB. - Contract tests (Pact) to guarantee the shape of events.
- Chaos engineering – Use Gremlin or Chaos Mesh to kill the inventory pod mid‑saga, then verify that compensation runs and the order ends up in
cancelledstate.
Key Takeaways
- Sagas replace 2PC with a chain of locally committed transactions plus compensating actions, giving you high availability without sacrificing consistency.
- Choose choreography for loose coupling (Kafka, NATS) but add a lightweight orchestrator if you need a global view or UI‑driven status.
- Idempotency is non‑optional; design each consumer to be safe under at‑least‑once delivery.
- Persist saga state (e.g., in a relational DB or Redis) so you can recover from orchestrator crashes and enforce timeouts.
- Instrument everything: correlation IDs, Prometheus metrics, and OpenTelemetry traces are essential for detecting stuck sagas and measuring compensation frequency.
- Test failure paths as thoroughly as happy paths—real‑world commerce systems spend more time handling refunds, stock releases, and order cancellations than processing perfect orders.