Chaos Resilience Drill
Everyone claims their retry logic works. Almost nobody can show it. With LocalEmu's throttling and latency simulation, you can prove a retry / backoff / latency-budget envelope holds, in the inner loop, no production incident required. Same Lambda, same load, three different LocalEmu chaos configs, three different stories. All three finish with zero data loss.
dynamodb:PutItem scoped to that one table,
and a Lambda whose handler does one idempotent PutItem
with bounded exponential-backoff retry (100, 200, 400, 800, 1600 ms with 0 to 50% jitter,
MAX_ATTEMPTS=5) on every AWS throttle code.
Then fires 50 Lambda invocations 10-way concurrent via
xargs -P, captures each response, rolls up
ok, failed, throttle events, sum(attempts), latency p50/p95/p99,
and finally scans the table to assert COUNT == 50
(zero data loss). Run the same demo under three LocalEmu chaos configs (calm-mode,
SIMULATE_THROTTLING=1 DYNAMODB_THROTTLE_RATE=0.3,
SIMULATE_LATENCY=300) to see the envelope move.
Source: 21-chaos-resilience/ in the examples repo.
Real AWS error codes
DDB returns ProvisionedThroughputExceededException; S3 returns SlowDown. Same codes, same status codes, same retry surface as production.
Observable retries
boto3's auto-retry layer is disabled. The handler's own retry loop is the only one, so reported attempts/throttles are the actual counts.
Zero-loss assertion
After each run, the driver scans DynamoDB and asserts COUNT == REQUESTS. The retry envelope either held or it did not, and the database row count says which.
Step-by-Step Walkthrough
Step 1: The resilient writer
# src/handler.py (excerpt), explicit, observable retry loop.
# boto3 auto-retry disabled (Config(retries={"max_attempts":0}))
# so the counts the driver reports are the actual ones.
_THROTTLE_CODES = {
"ProvisionedThroughputExceededException", "ThrottlingException",
"Throttling", "RequestLimitExceeded", "TooManyRequestsException",
"SlowDown", "OverLimit", "LimitExceededException",
}
while attempts < MAX_ATTEMPTS:
attempts += 1
try:
_ddb.put_item(TableName=TABLE, Item=...)
return {"ok": True, "attempts": attempts, "throttled": throttled, ...}
except ClientError as e:
code = e.response["Error"]["Code"]
if code not in _THROTTLE_CODES:
return {"ok": False, "error": code, ...} # surface non-throttle errors
throttled += 1
if attempts < MAX_ATTEMPTS:
time.sleep(0.1 * (2 ** (attempts - 1)) * (1 + random.random() * 0.5)) Backoff sequence: 100 ms, 200 ms, 400 ms, 800 ms, 1600 ms with 0 to 50% additive jitter, MAX_ATTEMPTS=5. Non-throttle errors are surfaced immediately, the demo measures resilience to throttling specifically, not blanket retry.
Step 2: Calm-mode baseline
$ localemu start # calm-mode baseline $ ./scripts/demo.sh
== 4. Load test: 50 requests, 10-way concurrent ==
LocalEmu chaos config:
no chaos flags detected (LocalEmu is in calm-mode)
Firing 50 invocations…
✓ All invocations returned in 7s wall-clock
== 5. Summary ==
requests: 50
ok: 50
failed: 0
throttle events: 0
sum(attempts): 50 (= 50 means zero retries needed)
latency p50: 21 ms
latency p95: 75 ms
latency p99: 103 ms
== 6. Zero-data-loss assertion ==
✓ Zero data loss: 50/50 requests landed in DDB 50 requests land in p95 75 ms with zero retries. This is your control run, everything below moves against this baseline.
Step 3: 30% DynamoDB throttle
$ SIMULATE_THROTTLING=1 DYNAMODB_THROTTLE_RATE=0.3 \
localemu start # inject 30% throttling on DDB $ ./scripts/demo.sh
== 4. Load test: 50 requests, 10-way concurrent ==
LocalEmu chaos config:
DYNAMODB_THROTTLE_RATE=0.3
SIMULATE_THROTTLING=1
Firing 50 invocations…
✓ All invocations returned in 7s wall-clock
== 5. Summary ==
requests: 50
ok: 50
failed: 0
throttle events: 26
sum(attempts): 76 (= 50 means zero retries needed)
latency p50: 14 ms
latency p95: 513 ms
latency p99: 980 ms
== 6. Zero-data-loss assertion ==
✓ Zero data loss: 50/50 requests landed in DDB LocalEmu rejects about 30% of PutItem calls with the real AWS error code. The retry loop fires 26 extra attempts to land the 50 items (sum(attempts) = 76). p95 latency jumps from 75 ms to 513 ms because of the backoff sleeps. Every item still lands, zero data loss.
Step 4: 300 ms fixed latency
$ SIMULATE_LATENCY=300 localemu start # add 300 ms to every API response $ ./scripts/demo.sh
== 4. Load test: 50 requests, 10-way concurrent ==
LocalEmu chaos config:
SIMULATE_LATENCY=300
Firing 50 invocations…
✓ All invocations returned in 12s wall-clock
== 5. Summary ==
requests: 50
ok: 50
failed: 0
throttle events: 0
sum(attempts): 50 (= 50 means zero retries needed)
latency p50: 318 ms
latency p95: 348 ms
latency p99: 397 ms
== 6. Zero-data-loss assertion ==
✓ Zero data loss: 50/50 requests landed in DDB No throttles, no retries, but every request pays the latency. p50/p95/p99 all sit near 300 ms with a tight spread. This is the right scenario to use when sizing Lambda timeouts and API Gateway integration timeouts.
What the run proves
| LocalEmu config | ok | throttle events | retries | p50 | p95 | p99 | data loss |
|---|---|---|---|---|---|---|---|
| calm-mode | 50/50 | 0 | 0 | 21 ms | 75 ms | 103 ms | 0 |
| DYNAMODB_THROTTLE_RATE=0.3 | 50/50 | 26 | 26 | 14 ms | 513 ms | 980 ms | 0 |
| SIMULATE_LATENCY=300 | 50/50 | 0 | 0 | 318 ms | 348 ms | 397 ms | 0 |
Full source: src/handler.py
The resilient writer. One PutItem per invocation with bounded exponential backoff on every AWS throttle code. boto3's auto-retry is explicitly disabled (Config(retries={'max_attempts': 0})) so the reported attempts and throttle counts are the actual ones.
"""
Resilient writer Lambda. Idempotent put_item with exponential backoff
retry on throttle (ProvisionedThroughputExceededException), bounded
attempts, jitter. Returns a structured envelope the load driver can
roll up into success / retry / failure counts.
The handler is explicit about what it survived: it does NOT swallow
throttles silently. Every throttle that triggers a retry is counted;
a request that exhausts its retries returns ok=false so the driver
can report it as a real failure.
"""
import os
import random
import time
import boto3
from botocore.config import Config
from botocore.exceptions import ClientError
TABLE = os.environ["TABLE_NAME"]
MAX_ATTEMPTS = int(os.environ.get("MAX_ATTEMPTS", "5"))
# Disable boto3's own retry layer so OUR retry loop is the only one;
# otherwise we'd double-count and the timing data would be misleading.
_ddb = boto3.client("dynamodb", config=Config(retries={"max_attempts": 0}))
_THROTTLE_CODES = {
"ProvisionedThroughputExceededException",
"ThrottlingException",
"Throttling",
"RequestLimitExceeded",
"TooManyRequestsException",
"SlowDown",
"OverLimit",
"LimitExceededException",
}
def _sleep_with_jitter(attempt: int) -> float:
"""Exponential backoff 100ms -> 200ms -> 400ms -> 800ms -> 1600ms,
plus 0-50% additive jitter. Returns the actual seconds slept."""
base = 0.1 * (2 ** attempt)
jitter = base * random.random() * 0.5
delay = base + jitter
time.sleep(delay)
return delay
def handler(event, _ctx):
req_id = event["request_id"]
payload = event.get("payload", "default")
attempts = 0
throttled = 0
total_backoff_s = 0.0
start = time.monotonic()
last_error = None
while attempts < MAX_ATTEMPTS:
attempts += 1
try:
_ddb.put_item(
TableName=TABLE,
Item={
"request_id": {"S": req_id},
"payload": {"S": payload},
"attempts_to_land": {"N": str(attempts)},
},
)
return {
"ok": True,
"request_id": req_id,
"attempts": attempts,
"throttled": throttled,
"elapsed_ms": round((time.monotonic() - start) * 1000, 1),
"total_backoff_ms": round(total_backoff_s * 1000, 1),
}
except ClientError as e:
code = e.response.get("Error", {}).get("Code", "")
last_error = code
if code not in _THROTTLE_CODES:
# Non-throttle error: surface it, don't retry blindly.
return {
"ok": False,
"request_id": req_id,
"attempts": attempts,
"throttled": throttled,
"elapsed_ms": round((time.monotonic() - start) * 1000, 1),
"error": code,
}
throttled += 1
if attempts >= MAX_ATTEMPTS:
break
total_backoff_s += _sleep_with_jitter(attempts - 1)
return {
"ok": False,
"request_id": req_id,
"attempts": attempts,
"throttled": throttled,
"elapsed_ms": round((time.monotonic() - start) * 1000, 1),
"total_backoff_ms": round(total_backoff_s * 1000, 1),
"error": last_error or "retries_exhausted",
} Full demo output: calm-mode baseline
Captured on LocalEmu v0.1.dev133 with no chaos flags set.
[14:00:33] Clearing any previous run state (best effort)
✓ Clean slate
== 1. Create DynamoDB table ==
[14:00:34] Creating table chaos-writes
✓ Table chaos-writes ACTIVE
== 2. Create Lambda role ==
[14:00:35] Creating role chaos-lambda-role
✓ Role arn:aws:iam::000000000000:role/chaos-lambda-role
== 3. Package + create the resilient writer Lambda ==
[14:00:35] Zip built
✓ Lambda chaos-resilient-writer Active
✓ Setup complete. Ids in ./.state/ids.env
== 4. Load test: 50 requests, 10-way concurrent ==
[14:00:36] LocalEmu chaos config:
no chaos flags detected (LocalEmu is in calm-mode)
[14:00:36] Firing 50 invocations…
✓ All invocations returned in 7s wall-clock
== 5. Summary ==
requests: 50
ok: 50
failed: 0
throttle events: 0
sum(attempts): 50 (= 50 means zero retries needed)
latency p50: 21 ms
latency p95: 75 ms
latency p99: 103 ms
== 6. Zero-data-loss assertion ==
[14:00:43] DynamoDB table holds 50 unique items
✓ Zero data loss: 50/50 requests landed in DDB
→ demo complete. Run scripts/teardown.sh when done. Full demo output: 30% DynamoDB throttling
Same demo, same load, restarted LocalEmu with SIMULATE_THROTTLING=1 DYNAMODB_THROTTLE_RATE=0.3.
[14:02:26] Clearing any previous run state (best effort)
✓ Clean slate
== 1. Create DynamoDB table ==
[14:02:28] Creating table chaos-writes
✓ Table chaos-writes ACTIVE
== 2. Create Lambda role ==
[14:02:28] Creating role chaos-lambda-role
✓ Role arn:aws:iam::000000000000:role/chaos-lambda-role
== 3. Package + create the resilient writer Lambda ==
[14:02:29] Zip built
✓ Lambda chaos-resilient-writer Active
✓ Setup complete. Ids in ./.state/ids.env
== 4. Load test: 50 requests, 10-way concurrent ==
[14:02:35] LocalEmu chaos config:
DYNAMODB_THROTTLE_RATE=0.3
SIMULATE_THROTTLING=1
[14:02:35] Firing 50 invocations…
✓ All invocations returned in 7s wall-clock
== 5. Summary ==
requests: 50
ok: 50
failed: 0
throttle events: 26
sum(attempts): 76 (= 50 means zero retries needed)
latency p50: 14 ms
latency p95: 513 ms
latency p99: 980 ms
== 6. Zero-data-loss assertion ==
[14:02:43] DynamoDB table holds 50 unique items
✓ Zero data loss: 50/50 requests landed in DDB
→ demo complete. Run scripts/teardown.sh when done. Full demo output: 300 ms fixed latency
Same demo, same load, restarted LocalEmu with SIMULATE_LATENCY=300.
[14:04:07] Clearing any previous run state (best effort)
✓ Clean slate
== 1. Create DynamoDB table ==
[14:04:09] Creating table chaos-writes
✓ Table chaos-writes ACTIVE
== 2. Create Lambda role ==
[14:04:11] Creating role chaos-lambda-role
✓ Role arn:aws:iam::000000000000:role/chaos-lambda-role
== 3. Package + create the resilient writer Lambda ==
[14:04:12] Zip built
✓ Lambda chaos-resilient-writer Active
✓ Setup complete. Ids in ./.state/ids.env
== 4. Load test: 50 requests, 10-way concurrent ==
[14:04:19] LocalEmu chaos config:
SIMULATE_LATENCY=300
[14:04:19] Firing 50 invocations…
✓ All invocations returned in 12s wall-clock
== 5. Summary ==
requests: 50
ok: 50
failed: 0
throttle events: 0
sum(attempts): 50 (= 50 means zero retries needed)
latency p50: 318 ms
latency p95: 348 ms
latency p99: 397 ms
== 6. Zero-data-loss assertion ==
[14:04:31] DynamoDB table holds 50 unique items
✓ Zero data loss: 50/50 requests landed in DDB
→ demo complete. Run scripts/teardown.sh when done. Files
Repository layout.
21-chaos-resilience/
├── README.md
├── scripts/
│ ├── demo.sh
│ └── teardown.sh
├── lib/
│ └── common.sh (awsx, detect_chaos helper)
├── src/
│ └── handler.py (idempotent PutItem with bounded retry + jitter)
└── infra/
├── 01_setup.sh (DDB table, IAM role, Lambda)
└── 02_load_test.sh (N concurrent invokes + zero-loss assertion)