Docs / Use Cases / Step Functions Saga with Fault Injection

Step Functions Saga with Fault Injection

A four-step order saga (reserve → charge → ship → notify) with cascading compensations. Each forward state has an ASL Catch that jumps to the compensation for the previous successful step, which chains down to undo every side-effect. Three failure scenarios are injected and the final state of each DynamoDB table is asserted to be clean.

What the demo does. Provisions 3 DynamoDB tables (one per side-effect: inventory, payments, shipments), 1 SNS topic for notifications, an IAM role for the Lambdas with PutItem/DeleteItem on the tables and Publish on the topic, an IAM role for Step Functions with lambda:InvokeFunction, and 7 Lambda functions (one handler deployed seven times, each with a different STEP_NAME env var that picks the branch: 4 forward + 3 compensations). Starts an execution with no fault and asserts SUCCEEDED plus a row in each of the 3 tables. Then runs three failure scenarios (fail_at=charge, ship, notify): each must reach FailHandler via the appropriate cascading compensation chain, and every DDB table must be empty for that order_id. Total runtime ~30 s. No special LocalEmu flags required; works under IAM_ENFORCEMENT=1. Source: 20-stepfunctions-saga/ in the examples repo.

Topology

              ReserveInventory ──► ChargePayment ──► CreateShipment ──► SendNotification
                    │                  │                  │                   │
                    │ catch            │ catch            │ catch             │ catch
                    ▼                  ▼                  ▼                   ▼
              FailHandler      CompensateReserve  CompensateCharge    CompensateShip
                                       ▲                  │                   │
                                       │                  ▼                   ▼
                                       └────────  CompensateReserve   CompensateCharge
                                                                              │
                                                                              ▼
                                                                      CompensateReserve
                                                                              │
                                                                              ▼
                                                                        FailHandler
🔁

Real execution engine

LocalEmu actually runs the ASL state machine. Each Task invokes a real Lambda; Catch routes on real errors.

🧪

Single-flag fault injection

One fail_at input flag; no per-test mocks. The handler's production shape is preserved across all scenarios.

🧹

Verifiable clean state

After each fault, the test queries DynamoDB directly and asserts every table is empty for the failed order_id. The compensation chain either ran or it did not, and the table state says which.

Step-by-Step Walkthrough

Step 1: Start LocalEmu

Terminal
localemu start

No special flags needed. The Step Functions service is always on; the demo also works correctly under IAM_ENFORCEMENT=1.

Step 2: One handler, seven deployments

src/handler.py
STEP = os.environ["STEP_NAME"]  # one of: reserve, charge, ship, notify, comp_*

def handler(event, _ctx):
    order_id = event["order_id"]
    fail_at = event.get("fail_at")

    # Fault-injection switch, one input flag is enough.
    if fail_at == STEP:
        raise RuntimeError(f"injected failure at '{STEP}'")

    if STEP == "reserve":
        ddb.put_item(TableName=INVENTORY_TABLE,
                     Item={"order_id": {"S": order_id}, ...})
    elif STEP == "charge":
        ddb.put_item(TableName=PAYMENTS_TABLE,  ...)
    elif STEP == "ship":
        ddb.put_item(TableName=SHIPMENTS_TABLE, ...)
    elif STEP == "notify":
        sns.publish(TopicArn=NOTIFICATIONS_TOPIC, ...)
    elif STEP == "comp_ship":
        ddb.delete_item(TableName=SHIPMENTS_TABLE, ...)
    elif STEP == "comp_charge":
        ddb.delete_item(TableName=PAYMENTS_TABLE,  ...)
    elif STEP == "comp_reserve":
        ddb.delete_item(TableName=INVENTORY_TABLE, ...)
    return event

Each Lambda gets a different STEP_NAME env var that picks the branch. The fail_at flag in the input event makes the matching step raise, that's the entire fault-injection surface.

Step 3: The state machine (ASL)

State machine definition
// Order saga ASL, Catch on every forward state jumps to the
// compensation for the previous successful step. Each compensation
// chains down to FailHandler.
{
  "Comment": "Order saga: reserve -> charge -> ship -> notify with cascading compensations",
  "StartAt": "ReserveInventory",
  "States": {
    "ReserveInventory": {
      "Type": "Task",
      "Resource": "<reserve-arn>",
      "Next": "ChargePayment",
      "Catch": [{"ErrorEquals":["States.ALL"],"Next":"FailHandler","ResultPath":"$.error"}]
    },
    "ChargePayment": {
      "Type": "Task",
      "Resource": "<charge-arn>",
      "Next": "CreateShipment",
      "Catch": [{"ErrorEquals":["States.ALL"],"Next":"CompensateReserve","ResultPath":"$.error"}]
    },
    "CreateShipment": {
      "Type": "Task",
      "Resource": "<ship-arn>",
      "Next": "SendNotification",
      "Catch": [{"ErrorEquals":["States.ALL"],"Next":"CompensateCharge","ResultPath":"$.error"}]
    },
    "SendNotification": {
      "Type": "Task",
      "Resource": "<notify-arn>",
      "End": true,
      "Catch": [{"ErrorEquals":["States.ALL"],"Next":"CompensateShip","ResultPath":"$.error"}]
    },
    "CompensateShip":    { "Type": "Task", "Resource": "<comp-ship-arn>",    "Next": "CompensateCharge"  },
    "CompensateCharge":  { "Type": "Task", "Resource": "<comp-charge-arn>",  "Next": "CompensateReserve" },
    "CompensateReserve": { "Type": "Task", "Resource": "<comp-reserve-arn>", "Next": "FailHandler"        },
    "FailHandler": { "Type": "Fail", "Cause": "saga compensated" }
  }
}

Each forward state's Catch jumps to the compensation for the previous successful step. From there the compensations chain down to FailHandler, undoing each prior write in reverse order.

Step 4: Happy path

./infra/02_happy_path.sh
$ ./infra/02_happy_path.sh
== 5. Happy path: invoke saga, expect SUCCEEDED + all 3 tables populated ==
  Starting execution for order_id=happy-1778842587
  EXEC=arn:aws:states:us-east-1:000000000000:execution:saga-order-saga:1420fd1e...
 Execution status = SUCCEEDED
  Asserting DDB state
   inventory has happy-1778842587
   payments has happy-1778842587
   shipments has happy-1778842587
 Happy path verified, all forward steps committed their side effects.

Start an execution with no fault. The 4 forward states run, each writes its row, and the execution ends SUCCEEDED. All 3 tables hold the order.

Step 5: Fault injection, 3 scenarios, cascading compensations

./infra/03_fault_inject.sh
$ ./infra/03_fault_inject.sh
== 6. Fault injection: three failure scenarios, all must compensate cleanly ==
  Scenario: fail_at=charge  order_id=fail-charge-1778842596
   execution status = FAILED (FailHandler reached)
   saga-inventory is clean for fail-charge-1778842596

  Scenario: fail_at=ship    order_id=fail-ship-1778842600
   execution status = FAILED (FailHandler reached)
   saga-inventory is clean for fail-ship-1778842600
   saga-payments  is clean for fail-ship-1778842600

  Scenario: fail_at=notify  order_id=fail-notify-1778842603
   execution status = FAILED (FailHandler reached)
   saga-inventory  is clean for fail-notify-1778842603
   saga-payments   is clean for fail-notify-1778842603
   saga-shipments  is clean for fail-notify-1778842603
 All three failure scenarios compensated cleanly.

Each scenario injects a failure at a different point. Every execution ends FAILED (reaching FailHandler is the intended terminal state for a fully-compensated saga), and the DynamoDB assertion proves no residual writes survive.

What the run proves

Scenario Execution status Compensations fired End state
no fail_atSUCCEEDEDnoneall 3 tables hold the order
fail_at=chargeFAILEDCompensateReserveall 3 tables empty
fail_at=shipFAILEDCompensateCharge → CompensateReserveall 3 tables empty
fail_at=notifyFAILEDCompensateShip → CompensateCharge → CompensateReserveall 3 tables empty

Full source: src/handler.py

A single handler deployed seven times, one per saga step. STEP_NAME picks the branch. The fault-injection flag (event['fail_at']) raises RuntimeError when it matches the current STEP_NAME, which the ASL Catch routes to the appropriate compensation chain.

"""
A single handler deployed seven times, each instance is wired to one
step of an order-processing saga by its STEP_NAME env var:

    Forward:        reserve -&gt; charge -&gt; ship -&gt; notify
    Compensations:  comp_ship -&gt; comp_charge -&gt; comp_reserve

The state-machine ASL catches any forward-step failure and jumps to
the compensation for the *previous* successful step, cascading down
to release every side-effect that was applied. Inject a failure with
event["fail_at"] = "&lt;step&gt;".

The handler is intentionally tiny: each step writes one DDB item or
publishes one SNS message. Compensations delete the corresponding
DDB item; the notification step has no compensation because it is
the last forward step.
"""

import json
import os

import boto3
from botocore.exceptions import ClientError


STEP = os.environ["STEP_NAME"]

INVENTORY_TABLE = os.environ.get("INVENTORY_TABLE", "")
PAYMENTS_TABLE = os.environ.get("PAYMENTS_TABLE", "")
SHIPMENTS_TABLE = os.environ.get("SHIPMENTS_TABLE", "")
NOTIFICATIONS_TOPIC = os.environ.get("NOTIFICATIONS_TOPIC", "")


def _ddb():
    return boto3.client("dynamodb")


def _sns():
    return boto3.client("sns")


def handler(event, _ctx):
    order_id = event["order_id"]
    fail_at = event.get("fail_at")

    if fail_at == STEP:
        raise RuntimeError(f"injected failure at '{STEP}'")

    if STEP == "reserve":
        _ddb().put_item(
            TableName=INVENTORY_TABLE,
            Item={"order_id": {"S": order_id}, "qty": {"N": str(event.get("qty", 1))}},
        )
        return {**event, "inventory_reserved": True}

    if STEP == "charge":
        _ddb().put_item(
            TableName=PAYMENTS_TABLE,
            Item={"order_id": {"S": order_id}, "amount": {"N": str(event.get("amount", 100))}},
        )
        return {**event, "payment_charged": True}

    if STEP == "ship":
        _ddb().put_item(
            TableName=SHIPMENTS_TABLE,
            Item={"order_id": {"S": order_id}, "status": {"S": "created"}},
        )
        return {**event, "shipment_created": True}

    if STEP == "notify":
        _sns().publish(
            TopicArn=NOTIFICATIONS_TOPIC,
            Message=json.dumps({"order_id": order_id, "status": "complete"}),
        )
        return {**event, "notification_sent": True}

    if STEP == "comp_ship":
        try:
            _ddb().delete_item(TableName=SHIPMENTS_TABLE, Key={"order_id": {"S": order_id}})
        except ClientError:
            pass
        return {**event, "shipment_cancelled": True}

    if STEP == "comp_charge":
        try:
            _ddb().delete_item(TableName=PAYMENTS_TABLE, Key={"order_id": {"S": order_id}})
        except ClientError:
            pass
        return {**event, "payment_refunded": True}

    if STEP == "comp_reserve":
        try:
            _ddb().delete_item(TableName=INVENTORY_TABLE, Key={"order_id": {"S": order_id}})
        except ClientError:
            pass
        return {**event, "inventory_released": True}

    raise ValueError(f"unknown STEP_NAME: {STEP}")

Full demo output

Captured from a clean run on LocalEmu v0.1.dev133. Each execution prints its full Step Functions ARN so you can also pull it up in the LocalEmu dashboard.

[12:56:15] Clearing any previous run state (best effort)
  ✓ Clean slate

== 1. Create DynamoDB tables + SNS topic ==
[12:56:20] Creating table saga-inventory
[12:56:20] Creating table saga-payments
[12:56:20] Creating table saga-shipments
  ✓ Table saga-inventory ACTIVE
  ✓ Table saga-payments ACTIVE
  ✓ Table saga-shipments ACTIVE
[12:56:21] Creating SNS topic saga-notifications
  ✓ Topic arn:aws:sns:us-east-1:000000000000:saga-notifications

== 2. Create IAM roles ==
[12:56:22] Creating Lambda role saga-lambda-role
  ✓ Lambda role arn:aws:iam::000000000000:role/saga-lambda-role
[12:56:22] Creating Step Functions role saga-sfn-role
  ✓ Step Functions role arn:aws:iam::000000000000:role/saga-sfn-role

== 3. Package + create the seven Lambda functions ==
  ✓ Zip built
[12:56:23] Creating Lambda saga-reserve
  ✓   saga-reserve arn:aws:lambda:us-east-1:000000000000:function:saga-reserve
[12:56:23] Creating Lambda saga-charge
  ✓   saga-charge arn:aws:lambda:us-east-1:000000000000:function:saga-charge
[12:56:24] Creating Lambda saga-ship
  ✓   saga-ship arn:aws:lambda:us-east-1:000000000000:function:saga-ship
[12:56:24] Creating Lambda saga-notify
  ✓   saga-notify arn:aws:lambda:us-east-1:000000000000:function:saga-notify
[12:56:25] Creating Lambda saga-comp_ship
  ✓   saga-comp_ship arn:aws:lambda:us-east-1:000000000000:function:saga-comp_ship
[12:56:25] Creating Lambda saga-comp_charge
  ✓   saga-comp_charge arn:aws:lambda:us-east-1:000000000000:function:saga-comp_charge
[12:56:26] Creating Lambda saga-comp_reserve
  ✓   saga-comp_reserve arn:aws:lambda:us-east-1:000000000000:function:saga-comp_reserve

== 4. Create the state machine ==
  ✓ State machine arn:aws:states:us-east-1:000000000000:stateMachine:saga-order-saga
  ✓ Setup complete. Ids in ./.state/ids.env

== 5. Happy path: invoke saga, expect SUCCEEDED + all 3 tables populated ==
[12:56:27] Starting execution for order_id=happy-1778842587
EXEC=arn:aws:states:us-east-1:000000000000:execution:saga-order-saga:1420fd1e-c292-40ad-b92b-fcf3f8c2e3b1
  ✓ Execution status = SUCCEEDED
[12:56:35] Asserting DDB state
  ✓   inventory has happy-1778842587
  ✓   payments has happy-1778842587
  ✓   shipments has happy-1778842587
  ✓ Happy path verified: all forward steps committed their side effects.

== 6. Fault injection: three failure scenarios, all must compensate cleanly ==
[12:56:36] Scenario: fail_at=charge  order_id=fail-charge-1778842596
EXEC=arn:aws:states:us-east-1:000000000000:execution:saga-order-saga:2d5869ef-e245-4bba-9059-8527677562ba
  ✓   execution status = FAILED (FailHandler reached)
  ✓   saga-inventory is clean for fail-charge-1778842596
[12:56:40] Scenario: fail_at=ship  order_id=fail-ship-1778842600
EXEC=arn:aws:states:us-east-1:000000000000:execution:saga-order-saga:002dc628-59ce-40cc-8093-fd226e147cc5
  ✓   execution status = FAILED (FailHandler reached)
  ✓   saga-inventory is clean for fail-ship-1778842600
  ✓   saga-payments is clean for fail-ship-1778842600
[12:56:43] Scenario: fail_at=notify  order_id=fail-notify-1778842603
EXEC=arn:aws:states:us-east-1:000000000000:execution:saga-order-saga:6ef95e7c-752a-4e75-97e6-95ad28be5bc5
  ✓   execution status = FAILED (FailHandler reached)
  ✓   saga-inventory is clean for fail-notify-1778842603
  ✓   saga-payments is clean for fail-notify-1778842603
  ✓   saga-shipments is clean for fail-notify-1778842603
  ✓ All three failure scenarios compensated cleanly.

→ demo complete. Run scripts/teardown.sh when done.

Files

Repository layout.

20-stepfunctions-saga/
├── README.md
├── scripts/
│   ├── demo.sh
│   └── teardown.sh
├── lib/
│   └── common.sh         (awsx, start_and_wait, ddb_has helpers)
├── src/
│   └── handler.py        (dispatch on STEP_NAME env)
└── infra/
    ├── 01_setup.sh       (3 DDB tables, SNS, 2 IAM roles, 7 Lambdas, state machine)
    ├── 02_happy_path.sh  (invoke, expect SUCCEEDED + all 3 tables populated)
    └── 03_fault_inject.sh (3 scenarios, each must compensate cleanly)