Step Functions Saga with Fault Injection
A four-step order saga (reserve → charge → ship → notify) with
cascading compensations. Each forward state has an ASL Catch
that jumps to the compensation for the previous successful step, which chains down to undo
every side-effect. Three failure scenarios are injected and the final state of each
DynamoDB table is asserted to be clean.
PutItem/DeleteItem
on the tables and Publish on the topic,
an IAM role for Step Functions with lambda:InvokeFunction,
and 7 Lambda functions (one handler deployed seven times, each with a different
STEP_NAME env var that picks the branch:
4 forward + 3 compensations).
Starts an execution with no fault and asserts SUCCEEDED
plus a row in each of the 3 tables.
Then runs three failure scenarios (fail_at=charge,
ship,
notify): each must reach
FailHandler via the appropriate cascading
compensation chain, and every DDB table must be empty for that order_id.
Total runtime ~30 s. No special LocalEmu flags required; works under
IAM_ENFORCEMENT=1.
Source: 20-stepfunctions-saga/ in the examples repo.
Topology
ReserveInventory ──► ChargePayment ──► CreateShipment ──► SendNotification
│ │ │ │
│ catch │ catch │ catch │ catch
▼ ▼ ▼ ▼
FailHandler CompensateReserve CompensateCharge CompensateShip
▲ │ │
│ ▼ ▼
└──────── CompensateReserve CompensateCharge
│
▼
CompensateReserve
│
▼
FailHandler
Real execution engine
LocalEmu actually runs the ASL state machine. Each Task invokes a real Lambda; Catch routes on real errors.
Single-flag fault injection
One fail_at input flag; no per-test mocks. The handler's production shape is preserved across all scenarios.
Verifiable clean state
After each fault, the test queries DynamoDB directly and asserts every table is empty for the failed order_id. The compensation chain either ran or it did not, and the table state says which.
Step-by-Step Walkthrough
Step 1: Start LocalEmu
localemu start No special flags needed. The Step Functions service is always on; the demo also works correctly under IAM_ENFORCEMENT=1.
Step 2: One handler, seven deployments
STEP = os.environ["STEP_NAME"] # one of: reserve, charge, ship, notify, comp_*
def handler(event, _ctx):
order_id = event["order_id"]
fail_at = event.get("fail_at")
# Fault-injection switch, one input flag is enough.
if fail_at == STEP:
raise RuntimeError(f"injected failure at '{STEP}'")
if STEP == "reserve":
ddb.put_item(TableName=INVENTORY_TABLE,
Item={"order_id": {"S": order_id}, ...})
elif STEP == "charge":
ddb.put_item(TableName=PAYMENTS_TABLE, ...)
elif STEP == "ship":
ddb.put_item(TableName=SHIPMENTS_TABLE, ...)
elif STEP == "notify":
sns.publish(TopicArn=NOTIFICATIONS_TOPIC, ...)
elif STEP == "comp_ship":
ddb.delete_item(TableName=SHIPMENTS_TABLE, ...)
elif STEP == "comp_charge":
ddb.delete_item(TableName=PAYMENTS_TABLE, ...)
elif STEP == "comp_reserve":
ddb.delete_item(TableName=INVENTORY_TABLE, ...)
return event Each Lambda gets a different STEP_NAME env var that picks the branch. The fail_at flag in the input event makes the matching step raise, that's the entire fault-injection surface.
Step 3: The state machine (ASL)
// Order saga ASL, Catch on every forward state jumps to the
// compensation for the previous successful step. Each compensation
// chains down to FailHandler.
{
"Comment": "Order saga: reserve -> charge -> ship -> notify with cascading compensations",
"StartAt": "ReserveInventory",
"States": {
"ReserveInventory": {
"Type": "Task",
"Resource": "<reserve-arn>",
"Next": "ChargePayment",
"Catch": [{"ErrorEquals":["States.ALL"],"Next":"FailHandler","ResultPath":"$.error"}]
},
"ChargePayment": {
"Type": "Task",
"Resource": "<charge-arn>",
"Next": "CreateShipment",
"Catch": [{"ErrorEquals":["States.ALL"],"Next":"CompensateReserve","ResultPath":"$.error"}]
},
"CreateShipment": {
"Type": "Task",
"Resource": "<ship-arn>",
"Next": "SendNotification",
"Catch": [{"ErrorEquals":["States.ALL"],"Next":"CompensateCharge","ResultPath":"$.error"}]
},
"SendNotification": {
"Type": "Task",
"Resource": "<notify-arn>",
"End": true,
"Catch": [{"ErrorEquals":["States.ALL"],"Next":"CompensateShip","ResultPath":"$.error"}]
},
"CompensateShip": { "Type": "Task", "Resource": "<comp-ship-arn>", "Next": "CompensateCharge" },
"CompensateCharge": { "Type": "Task", "Resource": "<comp-charge-arn>", "Next": "CompensateReserve" },
"CompensateReserve": { "Type": "Task", "Resource": "<comp-reserve-arn>", "Next": "FailHandler" },
"FailHandler": { "Type": "Fail", "Cause": "saga compensated" }
}
} Each forward state's Catch jumps to the compensation for the previous successful step. From there the compensations chain down to FailHandler, undoing each prior write in reverse order.
Step 4: Happy path
$ ./infra/02_happy_path.sh
== 5. Happy path: invoke saga, expect SUCCEEDED + all 3 tables populated ==
Starting execution for order_id=happy-1778842587
EXEC=arn:aws:states:us-east-1:000000000000:execution:saga-order-saga:1420fd1e...
✓ Execution status = SUCCEEDED
Asserting DDB state
✓ inventory has happy-1778842587
✓ payments has happy-1778842587
✓ shipments has happy-1778842587
✓ Happy path verified, all forward steps committed their side effects. Start an execution with no fault. The 4 forward states run, each writes its row, and the execution ends SUCCEEDED. All 3 tables hold the order.
Step 5: Fault injection, 3 scenarios, cascading compensations
$ ./infra/03_fault_inject.sh
== 6. Fault injection: three failure scenarios, all must compensate cleanly ==
Scenario: fail_at=charge order_id=fail-charge-1778842596
✓ execution status = FAILED (FailHandler reached)
✓ saga-inventory is clean for fail-charge-1778842596
Scenario: fail_at=ship order_id=fail-ship-1778842600
✓ execution status = FAILED (FailHandler reached)
✓ saga-inventory is clean for fail-ship-1778842600
✓ saga-payments is clean for fail-ship-1778842600
Scenario: fail_at=notify order_id=fail-notify-1778842603
✓ execution status = FAILED (FailHandler reached)
✓ saga-inventory is clean for fail-notify-1778842603
✓ saga-payments is clean for fail-notify-1778842603
✓ saga-shipments is clean for fail-notify-1778842603
✓ All three failure scenarios compensated cleanly. Each scenario injects a failure at a different point. Every execution ends FAILED (reaching FailHandler is the intended terminal state for a fully-compensated saga), and the DynamoDB assertion proves no residual writes survive.
What the run proves
| Scenario | Execution status | Compensations fired | End state |
|---|---|---|---|
no fail_at | SUCCEEDED | none | all 3 tables hold the order |
| fail_at=charge | FAILED | CompensateReserve | all 3 tables empty |
| fail_at=ship | FAILED | CompensateCharge → CompensateReserve | all 3 tables empty |
| fail_at=notify | FAILED | CompensateShip → CompensateCharge → CompensateReserve | all 3 tables empty |
Full source: src/handler.py
A single handler deployed seven times, one per saga step. STEP_NAME picks the branch. The fault-injection flag (event['fail_at']) raises RuntimeError when it matches the current STEP_NAME, which the ASL Catch routes to the appropriate compensation chain.
"""
A single handler deployed seven times, each instance is wired to one
step of an order-processing saga by its STEP_NAME env var:
Forward: reserve -> charge -> ship -> notify
Compensations: comp_ship -> comp_charge -> comp_reserve
The state-machine ASL catches any forward-step failure and jumps to
the compensation for the *previous* successful step, cascading down
to release every side-effect that was applied. Inject a failure with
event["fail_at"] = "<step>".
The handler is intentionally tiny: each step writes one DDB item or
publishes one SNS message. Compensations delete the corresponding
DDB item; the notification step has no compensation because it is
the last forward step.
"""
import json
import os
import boto3
from botocore.exceptions import ClientError
STEP = os.environ["STEP_NAME"]
INVENTORY_TABLE = os.environ.get("INVENTORY_TABLE", "")
PAYMENTS_TABLE = os.environ.get("PAYMENTS_TABLE", "")
SHIPMENTS_TABLE = os.environ.get("SHIPMENTS_TABLE", "")
NOTIFICATIONS_TOPIC = os.environ.get("NOTIFICATIONS_TOPIC", "")
def _ddb():
return boto3.client("dynamodb")
def _sns():
return boto3.client("sns")
def handler(event, _ctx):
order_id = event["order_id"]
fail_at = event.get("fail_at")
if fail_at == STEP:
raise RuntimeError(f"injected failure at '{STEP}'")
if STEP == "reserve":
_ddb().put_item(
TableName=INVENTORY_TABLE,
Item={"order_id": {"S": order_id}, "qty": {"N": str(event.get("qty", 1))}},
)
return {**event, "inventory_reserved": True}
if STEP == "charge":
_ddb().put_item(
TableName=PAYMENTS_TABLE,
Item={"order_id": {"S": order_id}, "amount": {"N": str(event.get("amount", 100))}},
)
return {**event, "payment_charged": True}
if STEP == "ship":
_ddb().put_item(
TableName=SHIPMENTS_TABLE,
Item={"order_id": {"S": order_id}, "status": {"S": "created"}},
)
return {**event, "shipment_created": True}
if STEP == "notify":
_sns().publish(
TopicArn=NOTIFICATIONS_TOPIC,
Message=json.dumps({"order_id": order_id, "status": "complete"}),
)
return {**event, "notification_sent": True}
if STEP == "comp_ship":
try:
_ddb().delete_item(TableName=SHIPMENTS_TABLE, Key={"order_id": {"S": order_id}})
except ClientError:
pass
return {**event, "shipment_cancelled": True}
if STEP == "comp_charge":
try:
_ddb().delete_item(TableName=PAYMENTS_TABLE, Key={"order_id": {"S": order_id}})
except ClientError:
pass
return {**event, "payment_refunded": True}
if STEP == "comp_reserve":
try:
_ddb().delete_item(TableName=INVENTORY_TABLE, Key={"order_id": {"S": order_id}})
except ClientError:
pass
return {**event, "inventory_released": True}
raise ValueError(f"unknown STEP_NAME: {STEP}") Full demo output
Captured from a clean run on LocalEmu v0.1.dev133. Each execution prints its full Step Functions ARN so you can also pull it up in the LocalEmu dashboard.
[12:56:15] Clearing any previous run state (best effort)
✓ Clean slate
== 1. Create DynamoDB tables + SNS topic ==
[12:56:20] Creating table saga-inventory
[12:56:20] Creating table saga-payments
[12:56:20] Creating table saga-shipments
✓ Table saga-inventory ACTIVE
✓ Table saga-payments ACTIVE
✓ Table saga-shipments ACTIVE
[12:56:21] Creating SNS topic saga-notifications
✓ Topic arn:aws:sns:us-east-1:000000000000:saga-notifications
== 2. Create IAM roles ==
[12:56:22] Creating Lambda role saga-lambda-role
✓ Lambda role arn:aws:iam::000000000000:role/saga-lambda-role
[12:56:22] Creating Step Functions role saga-sfn-role
✓ Step Functions role arn:aws:iam::000000000000:role/saga-sfn-role
== 3. Package + create the seven Lambda functions ==
✓ Zip built
[12:56:23] Creating Lambda saga-reserve
✓ saga-reserve arn:aws:lambda:us-east-1:000000000000:function:saga-reserve
[12:56:23] Creating Lambda saga-charge
✓ saga-charge arn:aws:lambda:us-east-1:000000000000:function:saga-charge
[12:56:24] Creating Lambda saga-ship
✓ saga-ship arn:aws:lambda:us-east-1:000000000000:function:saga-ship
[12:56:24] Creating Lambda saga-notify
✓ saga-notify arn:aws:lambda:us-east-1:000000000000:function:saga-notify
[12:56:25] Creating Lambda saga-comp_ship
✓ saga-comp_ship arn:aws:lambda:us-east-1:000000000000:function:saga-comp_ship
[12:56:25] Creating Lambda saga-comp_charge
✓ saga-comp_charge arn:aws:lambda:us-east-1:000000000000:function:saga-comp_charge
[12:56:26] Creating Lambda saga-comp_reserve
✓ saga-comp_reserve arn:aws:lambda:us-east-1:000000000000:function:saga-comp_reserve
== 4. Create the state machine ==
✓ State machine arn:aws:states:us-east-1:000000000000:stateMachine:saga-order-saga
✓ Setup complete. Ids in ./.state/ids.env
== 5. Happy path: invoke saga, expect SUCCEEDED + all 3 tables populated ==
[12:56:27] Starting execution for order_id=happy-1778842587
EXEC=arn:aws:states:us-east-1:000000000000:execution:saga-order-saga:1420fd1e-c292-40ad-b92b-fcf3f8c2e3b1
✓ Execution status = SUCCEEDED
[12:56:35] Asserting DDB state
✓ inventory has happy-1778842587
✓ payments has happy-1778842587
✓ shipments has happy-1778842587
✓ Happy path verified: all forward steps committed their side effects.
== 6. Fault injection: three failure scenarios, all must compensate cleanly ==
[12:56:36] Scenario: fail_at=charge order_id=fail-charge-1778842596
EXEC=arn:aws:states:us-east-1:000000000000:execution:saga-order-saga:2d5869ef-e245-4bba-9059-8527677562ba
✓ execution status = FAILED (FailHandler reached)
✓ saga-inventory is clean for fail-charge-1778842596
[12:56:40] Scenario: fail_at=ship order_id=fail-ship-1778842600
EXEC=arn:aws:states:us-east-1:000000000000:execution:saga-order-saga:002dc628-59ce-40cc-8093-fd226e147cc5
✓ execution status = FAILED (FailHandler reached)
✓ saga-inventory is clean for fail-ship-1778842600
✓ saga-payments is clean for fail-ship-1778842600
[12:56:43] Scenario: fail_at=notify order_id=fail-notify-1778842603
EXEC=arn:aws:states:us-east-1:000000000000:execution:saga-order-saga:6ef95e7c-752a-4e75-97e6-95ad28be5bc5
✓ execution status = FAILED (FailHandler reached)
✓ saga-inventory is clean for fail-notify-1778842603
✓ saga-payments is clean for fail-notify-1778842603
✓ saga-shipments is clean for fail-notify-1778842603
✓ All three failure scenarios compensated cleanly.
→ demo complete. Run scripts/teardown.sh when done. Files
Repository layout.
20-stepfunctions-saga/
├── README.md
├── scripts/
│ ├── demo.sh
│ └── teardown.sh
├── lib/
│ └── common.sh (awsx, start_and_wait, ddb_has helpers)
├── src/
│ └── handler.py (dispatch on STEP_NAME env)
└── infra/
├── 01_setup.sh (3 DDB tables, SNS, 2 IAM roles, 7 Lambdas, state machine)
├── 02_happy_path.sh (invoke, expect SUCCEEDED + all 3 tables populated)
└── 03_fault_inject.sh (3 scenarios, each must compensate cleanly)