TL;DR — The Saga pattern replaces heavyweight two‑phase commit with a series of local transactions and compensating actions, letting e‑commerce platforms stay consistent even when services fail. By combining an event‑driven coordinator (Kafka) with idempotent microservices (Spring Boot), you can achieve ACID‑like guarantees without a single point of failure.
In today’s high‑velocity commerce environments, a single purchase can touch inventory, pricing, payment, fraud detection, and shipping services—all owned by different teams and often deployed independently. Traditional monolithic transactions cannot span these boundaries, and naïve retries quickly lead to duplicate orders or lost inventory. This post walks you through the Saga pattern, shows how to wire it into a real‑world e‑commerce architecture, and provides concrete code snippets, monitoring tips, and failure‑mode handling that production engineers can copy into their own pipelines.
What is the Saga Pattern?
A saga is a sequence of local transactions where each step publishes an event that triggers the next step. If any step fails, a series of compensating transactions roll back the work already done. Unlike a traditional ACID transaction that locks resources across services, a saga:
- Keeps each service’s data locally consistent.
- Relies on asynchronous messaging for coordination.
- Guarantees eventual consistency, not immediate consistency.
The pattern was popularized by the microservices community and is described in depth on microservices.io’s saga page.
Types of Sagas
| Type | Coordination | Typical Use‑Case |
|---|---|---|
| Choreography | Each service reacts to events published by the previous service. | Low‑latency order pipelines where a central coordinator would be a bottleneck. |
| Orchestration | A dedicated saga orchestrator (often a state machine) sends commands to services. | Complex workflows with branching logic, timeouts, or retries that need a single source of truth. |
Both approaches have trade‑offs in latency, observability, and operational complexity. In the commerce scenario we’ll focus on orchestration because it gives us a clear audit trail for financial transactions.
Architecture in an E‑commerce System
Below is a simplified diagram of a typical checkout flow using the saga pattern:
[API Gateway] → [Order Service] → (Publish OrderCreated) → [Inventory Service] → (Publish InventoryReserved) → [Payment Service] → (Publish PaymentCaptured) → [Shipping Service] → (Publish ShipmentCreated)
If the Payment Service declines the charge, the orchestrator triggers compensations:
[Orchestrator] → (Send CancelReservation) → [Inventory Service] → (Publish InventoryReleased) → (Send CancelOrder) → [Order Service] → (Publish OrderCancelled)
Key Architectural Concerns
- Message Broker – We use Apache Kafka for its durability, ordering guarantees per partition, and built‑in consumer groups. See the official Kafka docs for configuration best practices: https://kafka.apache.org/documentation.
- State Store – The orchestrator persists saga state in a relational table (PostgreSQL) to survive restarts. The table records the current step, payload, and timestamps.
- Idempotency – Every service must treat duplicate commands as no‑ops. This is achieved by storing a business‑key (e.g.,
order_id) and checking aprocessed_eventstable before acting. - Compensation Logic – Unlike rollbacks, compensations are forward actions (e.g., releasing inventory). They must be idempotent and observable.
Saga Coordination Strategies
Orchestration with Spring State Machine
Spring State Machine provides a lightweight, declarative way to model saga steps as states and transitions. Below is a minimal configuration for the checkout saga:
# src/main/resources/saga-states.yaml
states:
- name: START
- name: INVENTORY_RESERVED
- name: PAYMENT_CAPTURED
- name: SHIPMENT_CREATED
- name: COMPLETED
- name: COMPENSATING
- name: ROLLED_BACK
transitions:
- source: START
target: INVENTORY_RESERVED
event: RESERVE_INVENTORY
- source: INVENTORY_RESERVED
target: PAYMENT_CAPTURED
event: CAPTURE_PAYMENT
- source: PAYMENT_CAPTURED
target: SHIPMENT_CREATED
event: CREATE_SHIPMENT
- source: SHIPMENT_CREATED
target: COMPLETED
event: FINISH
- source: * # any state
target: COMPENSATING
event: FAIL
- source: COMPENSATING
target: ROLLED_BACK
event: COMPENSATION_DONE
The orchestrator publishes the RESERVE_INVENTORY event to Kafka, and the state machine advances automatically when it receives the corresponding success event.
Choreography with Event Sourcing
In a pure choreography approach, each service publishes its own domain events. The flow is driven by event listeners:
@Service
public class InventoryListener {
@KafkaListener(topics = "order-created")
public void onOrderCreated(OrderCreatedEvent evt) {
if (reserveStock(evt.getOrderId(), evt.getItems())) {
kafkaTemplate.send("inventory-reserved", new InventoryReservedEvent(evt.getOrderId()));
} else {
kafkaTemplate.send("inventory-failed", new InventoryFailedEvent(evt.getOrderId()));
}
}
}
The advantage is zero central coordination, but debugging a failed saga often requires replaying the entire event stream.
Implementation Walkthrough with Kafka and Spring Boot
1. Define Shared Contracts
We store event schemas in a schemas/ folder and use Avro for versioning. Example order-created.avsc:
{
"type": "record",
"name": "OrderCreated",
"namespace": "com.acme.ecommerce.events",
"fields": [
{"name": "orderId", "type": "string"},
{"name": "customerId", "type": "string"},
{"name": "items", "type": {"type": "array", "items": "string"}},
{"name": "totalAmount", "type": "double"}
]
}
Both producer and consumer services generate Java classes via the avro-maven-plugin, ensuring type safety across the saga.
2. Producer: Order Service
@RestController
@RequestMapping("/orders")
@RequiredArgsConstructor
public class OrderController {
private final KafkaTemplate<String, OrderCreated> kafkaTemplate;
@PostMapping
public ResponseEntity<Void> placeOrder(@RequestBody OrderRequest req) {
String orderId = UUID.randomUUID().toString();
OrderCreated event = OrderCreated.newBuilder()
.setOrderId(orderId)
.setCustomerId(req.getCustomerId())
.setItems(req.getItems())
.setTotalAmount(req.getTotal())
.build();
kafkaTemplate.send("order-created", orderId, event);
return ResponseEntity.accepted().header("Location", "/orders/" + orderId).build();
}
}
The endpoint returns 202 Accepted because the request is now part of an asynchronous saga.
3. Consumer: Inventory Service
@Service
public class InventoryConsumer {
private final InventoryRepository repo;
private final KafkaTemplate<String, InventoryReserved> producer;
@KafkaListener(topics = "order-created", groupId = "inventory")
public void handle(OrderCreated event) {
boolean reserved = repo.reserve(event.getOrderId(), event.getItems());
if (reserved) {
InventoryReserved reservedEvt = InventoryReserved.newBuilder()
.setOrderId(event.getOrderId())
.build();
producer.send("inventory-reserved", event.getOrderId(), reservedEvt);
} else {
InventoryFailed failedEvt = InventoryFailed.newBuilder()
.setOrderId(event.getOrderId())
.setReason("Insufficient stock")
.build();
producer.send("inventory-failed", event.getOrderId(), failedEvt);
}
}
}
Notice the idempotency guard inside repo.reserve. It checks a processed_events table before attempting to lock stock.
4. Orchestrator Service (Spring State Machine)
@Service
@RequiredArgsConstructor
public class CheckoutOrchestrator {
private final StateMachineFactory<String, String> factory;
private final SagaStateRepository sagaRepo;
private final KafkaTemplate<String, ?> kafkaTemplate;
@KafkaListener(topics = {"inventory-reserved", "payment-captured", "shipment-created", "inventory-failed", "payment-failed"})
public void onEvent(ConsumerRecord<String, ?> record) {
String orderId = record.key();
SagaState state = sagaRepo.findByOrderId(orderId)
.orElseGet(() -> sagaRepo.save(new SagaState(orderId, "START")));
StateMachine<String, String> sm = factory.getStateMachine(orderId);
sm.start();
// Map Kafka events to state machine events
switch (record.topic()) {
case "inventory-reserved":
sm.sendEvent(MessageBuilder.withPayload("RESERVE_INVENTORY").setHeader("orderId", orderId).build());
break;
case "payment-captured":
sm.sendEvent(MessageBuilder.withPayload("CAPTURE_PAYMENT").setHeader("orderId", orderId).build());
break;
case "shipment-created":
sm.sendEvent(MessageBuilder.withPayload("CREATE_SHIPMENT").setHeader("orderId", orderId).build());
break;
case "inventory-failed":
case "payment-failed":
sm.sendEvent(MessageBuilder.withPayload("FAIL").setHeader("orderId", orderId).build());
break;
}
// Persist the new state
state.setCurrentState(sm.getState().getId());
sagaRepo.save(state);
}
}
The orchestrator listens to all saga‑related topics, updates the state machine, and persists progress. If a failure event arrives, it triggers the compensation branch automatically.
5. Compensation Handlers
Compensation logic lives in the same services that performed the forward action. For example, the Inventory Service adds a listener for cancel-reservation:
@KafkaListener(topics = "cancel-reservation")
public void compensate(CompensationRequest req) {
repo.release(req.getOrderId(), req.getItems());
kafkaTemplate.send("inventory-released", req.getOrderId(), new InventoryReleased());
}
Because the compensating command is idempotent, replaying the same cancel-reservation message (e.g., after a broker restart) does not corrupt data.
Patterns in Production: Compensation and Idempotency
Idempotent Command Handlers
Every incoming command should:
- Check a deduplication store (e.g., a Redis SET of processed
messageIds) before executing business logic. - Persist the outcome atomically with the business change, using the same transaction that updates the domain table.
# Pseudocode for a generic idempotent handler
def handle_command(cmd):
if redis.sismember("processed", cmd.id):
logger.info("Duplicate command %s ignored", cmd.id)
return
with db.transaction():
apply_business_logic(cmd)
redis.sadd("processed", cmd.id)
Timeout & Retry Policies
Kafka’s consumer groups guarantee at‑least‑once delivery. To avoid endless retries on permanent failures:
- Configure
max.poll.interval.msto a sensible value (e.g., 5 minutes) so a stalled saga is detected. - Use dead‑letter topics for messages that exceed
retries(commonly 3 attempts). The orchestrator can then move the saga to the compensation path.
# Example Kafka consumer config snippet
spring.kafka.consumer:
enable-auto-commit: false
max-poll-interval-ms: 300000
max-poll-records: 10
group-id: inventory
Observability
- Tracing – Correlate all saga events with a single
traceIdusing OpenTelemetry. Each service adds the sametraceIdheader to outgoing Kafka messages. - Metrics – Export counters such as
saga.completed,saga.compensated,saga.failedto Prometheus. - Dashboards – Grafana panels that show the average saga duration, failure rate per step, and backlog size per topic.
Testing and Validation
Unit Tests with Embedded Kafka
@EmbeddedKafka(partitions = 1, topics = {"order-created", "inventory-reserved"})
@SpringBootTest
class InventoryConsumerTest {
@Autowired
private KafkaTemplate<String, OrderCreated> template;
@Autowired
private InventoryRepository repo;
@Test
void shouldReserveStockWhenOrderCreated() {
OrderCreated evt = OrderCreated.newBuilder()
.setOrderId("order-123")
.setItems(List.of("sku-1", "sku-2"))
.build();
template.send("order-created", "order-123", evt);
// Allow async processing
Awaitility.await().atMost(5, TimeUnit.SECONDS)
.untilAsserted(() -> assertTrue(repo.isReserved("order-123")));
}
}
End‑to‑End Saga Simulation
Use Testcontainers to spin up Kafka, PostgreSQL, and the full microservice stack. Run a script that creates an order, then deliberately makes the payment service reject the charge, and verify that both inventory and order records are rolled back.
# Run the full stack locally
docker compose up -d
./gradlew integrationTest
Chaos Engineering
Inject latency or network partitions with Gremlin or Chaos Mesh to confirm the orchestrator correctly retries and eventually compensates. Record the Mean Time To Recover (MTTR) for each failure mode and feed it back into SLO calculations.
Key Takeaways
- The Saga pattern replaces distributed two‑phase commit with a chain of local transactions and compensations, delivering eventual consistency with high availability.
- Choose orchestration when you need a single source of truth for complex commerce workflows; use choreography for simple, low‑latency pipelines.
- Kafka provides durable, ordered messaging that is ideal for saga coordination; pair it with a state store (PostgreSQL) to survive process crashes.
- Idempotency, deduplication, and explicit compensation logic are non‑negotiable for production‑grade sagas.
- Observability (tracing, metrics, dead‑letter handling) turns a “fire‑and‑forget” workflow into a debuggable, SLA‑friendly system.