Docs / Use Cases / Chaos Resilience Drill

Chaos Resilience Drill

Everyone claims their retry logic works. Almost nobody can show it. With LocalEmu's throttling and latency simulation, you can prove a retry / backoff / latency-budget envelope holds, in the inner loop, no production incident required. Same Lambda, same load, three different LocalEmu chaos configs, three different stories. All three finish with zero data loss.

What the demo does. Provisions a DynamoDB table, an IAM role with dynamodb:PutItem scoped to that one table, and a Lambda whose handler does one idempotent PutItem with bounded exponential-backoff retry (100, 200, 400, 800, 1600 ms with 0 to 50% jitter, MAX_ATTEMPTS=5) on every AWS throttle code. Then fires 50 Lambda invocations 10-way concurrent via xargs -P, captures each response, rolls up ok, failed, throttle events, sum(attempts), latency p50/p95/p99, and finally scans the table to assert COUNT == 50 (zero data loss). Run the same demo under three LocalEmu chaos configs (calm-mode, SIMULATE_THROTTLING=1 DYNAMODB_THROTTLE_RATE=0.3, SIMULATE_LATENCY=300) to see the envelope move. Source: 21-chaos-resilience/ in the examples repo.

🧯

Real AWS error codes

DDB returns ProvisionedThroughputExceededException; S3 returns SlowDown. Same codes, same status codes, same retry surface as production.

🔬

Observable retries

boto3's auto-retry layer is disabled. The handler's own retry loop is the only one, so reported attempts/throttles are the actual counts.

🛟

Zero-loss assertion

After each run, the driver scans DynamoDB and asserts COUNT == REQUESTS. The retry envelope either held or it did not, and the database row count says which.

Step-by-Step Walkthrough

Step 1: The resilient writer

# src/handler.py (excerpt), explicit, observable retry loop.
# boto3 auto-retry disabled (Config(retries=&#123;"max_attempts":0&#125;))
# so the counts the driver reports are the actual ones.
_THROTTLE_CODES = &#123;
    "ProvisionedThroughputExceededException", "ThrottlingException",
    "Throttling", "RequestLimitExceeded", "TooManyRequestsException",
    "SlowDown", "OverLimit", "LimitExceededException",
&#125;

while attempts < MAX_ATTEMPTS:
    attempts += 1
    try:
        _ddb.put_item(TableName=TABLE, Item=...)
        return &#123;"ok": True, "attempts": attempts, "throttled": throttled, ...&#125;
    except ClientError as e:
        code = e.response["Error"]["Code"]
        if code not in _THROTTLE_CODES:
            return &#123;"ok": False, "error": code, ...&#125;          # surface non-throttle errors
        throttled += 1
        if attempts < MAX_ATTEMPTS:
            time.sleep(0.1 * (2 ** (attempts - 1)) * (1 + random.random() * 0.5))

Backoff sequence: 100 ms, 200 ms, 400 ms, 800 ms, 1600 ms with 0 to 50% additive jitter, MAX_ATTEMPTS=5. Non-throttle errors are surfaced immediately, the demo measures resilience to throttling specifically, not blanket retry.

Step 2: Calm-mode baseline

$ localemu start                                  # calm-mode baseline

$ ./scripts/demo.sh
== 4. Load test: 50 requests, 10-way concurrent ==
  LocalEmu chaos config:
      no chaos flags detected (LocalEmu is in calm-mode)
  Firing 50 invocations…
  ✓ All invocations returned in 7s wall-clock

== 5. Summary ==
      requests:        50
      ok:              50
      failed:          0
      throttle events: 0
      sum(attempts):   50  (= 50 means zero retries needed)
      latency p50:     21 ms
      latency p95:     75 ms
      latency p99:     103 ms

== 6. Zero-data-loss assertion ==
  ✓ Zero data loss: 50/50 requests landed in DDB

50 requests land in p95 75 ms with zero retries. This is your control run, everything below moves against this baseline.

Step 3: 30% DynamoDB throttle

$ SIMULATE_THROTTLING=1 DYNAMODB_THROTTLE_RATE=0.3 \
     localemu start                              # inject 30% throttling on DDB

$ ./scripts/demo.sh
== 4. Load test: 50 requests, 10-way concurrent ==
  LocalEmu chaos config:
      DYNAMODB_THROTTLE_RATE=0.3
      SIMULATE_THROTTLING=1
  Firing 50 invocations…
  ✓ All invocations returned in 7s wall-clock

== 5. Summary ==
      requests:        50
      ok:              50
      failed:          0
      throttle events: 26
      sum(attempts):   76  (= 50 means zero retries needed)
      latency p50:     14 ms
      latency p95:     513 ms
      latency p99:     980 ms

== 6. Zero-data-loss assertion ==
  ✓ Zero data loss: 50/50 requests landed in DDB

LocalEmu rejects about 30% of PutItem calls with the real AWS error code. The retry loop fires 26 extra attempts to land the 50 items (sum(attempts) = 76). p95 latency jumps from 75 ms to 513 ms because of the backoff sleeps. Every item still lands, zero data loss.

Step 4: 300 ms fixed latency

$ SIMULATE_LATENCY=300 localemu start             # add 300 ms to every API response

$ ./scripts/demo.sh
== 4. Load test: 50 requests, 10-way concurrent ==
  LocalEmu chaos config:
      SIMULATE_LATENCY=300
  Firing 50 invocations…
  ✓ All invocations returned in 12s wall-clock

== 5. Summary ==
      requests:        50
      ok:              50
      failed:          0
      throttle events: 0
      sum(attempts):   50  (= 50 means zero retries needed)
      latency p50:     318 ms
      latency p95:     348 ms
      latency p99:     397 ms

== 6. Zero-data-loss assertion ==
  ✓ Zero data loss: 50/50 requests landed in DDB

No throttles, no retries, but every request pays the latency. p50/p95/p99 all sit near 300 ms with a tight spread. This is the right scenario to use when sizing Lambda timeouts and API Gateway integration timeouts.

What the run proves

LocalEmu config	ok	throttle events	retries	p50	p95	p99
calm-mode	50/50	0	0	21 ms	75 ms	103 ms
DYNAMODB_THROTTLE_RATE=0.3	50/50	26	26	14 ms	513 ms	980 ms
SIMULATE_LATENCY=300	50/50	0	0	318 ms	348 ms	397 ms

Full source: src/handler.py

The resilient writer. One PutItem per invocation with bounded exponential backoff on every AWS throttle code. boto3's auto-retry is explicitly disabled (Config(retries={'max_attempts': 0})) so the reported attempts and throttle counts are the actual ones.

"""
Resilient writer Lambda. Idempotent put_item with exponential backoff
retry on throttle (ProvisionedThroughputExceededException), bounded
attempts, jitter. Returns a structured envelope the load driver can
roll up into success / retry / failure counts.

The handler is explicit about what it survived: it does NOT swallow
throttles silently. Every throttle that triggers a retry is counted;
a request that exhausts its retries returns ok=false so the driver
can report it as a real failure.
"""

import os
import random
import time

import boto3
from botocore.config import Config
from botocore.exceptions import ClientError


TABLE = os.environ["TABLE_NAME"]
MAX_ATTEMPTS = int(os.environ.get("MAX_ATTEMPTS", "5"))

# Disable boto3's own retry layer so OUR retry loop is the only one;
# otherwise we'd double-count and the timing data would be misleading.
_ddb = boto3.client("dynamodb", config=Config(retries={"max_attempts": 0}))

_THROTTLE_CODES = {
    "ProvisionedThroughputExceededException",
    "ThrottlingException",
    "Throttling",
    "RequestLimitExceeded",
    "TooManyRequestsException",
    "SlowDown",
    "OverLimit",
    "LimitExceededException",
}


def _sleep_with_jitter(attempt: int) -&gt; float:
    """Exponential backoff 100ms -&gt; 200ms -&gt; 400ms -&gt; 800ms -&gt; 1600ms,
    plus 0-50% additive jitter. Returns the actual seconds slept."""
    base = 0.1 * (2 ** attempt)
    jitter = base * random.random() * 0.5
    delay = base + jitter
    time.sleep(delay)
    return delay


def handler(event, _ctx):
    req_id = event["request_id"]
    payload = event.get("payload", "default")

    attempts = 0
    throttled = 0
    total_backoff_s = 0.0
    start = time.monotonic()
    last_error = None

    while attempts &lt; MAX_ATTEMPTS:
        attempts += 1
        try:
            _ddb.put_item(
                TableName=TABLE,
                Item={
                    "request_id": {"S": req_id},
                    "payload": {"S": payload},
                    "attempts_to_land": {"N": str(attempts)},
                },
            )
            return {
                "ok": True,
                "request_id": req_id,
                "attempts": attempts,
                "throttled": throttled,
                "elapsed_ms": round((time.monotonic() - start) * 1000, 1),
                "total_backoff_ms": round(total_backoff_s * 1000, 1),
            }
        except ClientError as e:
            code = e.response.get("Error", {}).get("Code", "")
            last_error = code
            if code not in _THROTTLE_CODES:
                # Non-throttle error: surface it, don't retry blindly.
                return {
                    "ok": False,
                    "request_id": req_id,
                    "attempts": attempts,
                    "throttled": throttled,
                    "elapsed_ms": round((time.monotonic() - start) * 1000, 1),
                    "error": code,
                }
            throttled += 1
            if attempts &gt;= MAX_ATTEMPTS:
                break
            total_backoff_s += _sleep_with_jitter(attempts - 1)

    return {
        "ok": False,
        "request_id": req_id,
        "attempts": attempts,
        "throttled": throttled,
        "elapsed_ms": round((time.monotonic() - start) * 1000, 1),
        "total_backoff_ms": round(total_backoff_s * 1000, 1),
        "error": last_error or "retries_exhausted",
    }

Full demo output: calm-mode baseline

Captured on LocalEmu v0.1.dev133 with no chaos flags set.

[14:00:33] Clearing any previous run state (best effort)
  ✓ Clean slate

== 1. Create DynamoDB table ==
[14:00:34] Creating table chaos-writes
  ✓ Table chaos-writes ACTIVE

== 2. Create Lambda role ==
[14:00:35] Creating role chaos-lambda-role
  ✓ Role arn:aws:iam::000000000000:role/chaos-lambda-role

== 3. Package + create the resilient writer Lambda ==
[14:00:35] Zip built
  ✓ Lambda chaos-resilient-writer Active
  ✓ Setup complete. Ids in ./.state/ids.env

== 4. Load test: 50 requests, 10-way concurrent ==
[14:00:36] LocalEmu chaos config:
      no chaos flags detected (LocalEmu is in calm-mode)
[14:00:36] Firing 50 invocations…
  ✓ All invocations returned in 7s wall-clock

== 5. Summary ==
      requests:        50
      ok:              50
      failed:          0
      throttle events: 0
      sum(attempts):   50  (= 50 means zero retries needed)
      latency p50:     21 ms
      latency p95:     75 ms
      latency p99:     103 ms

== 6. Zero-data-loss assertion ==
[14:00:43] DynamoDB table holds 50 unique items
  ✓ Zero data loss: 50/50 requests landed in DDB

→ demo complete. Run scripts/teardown.sh when done.

Full demo output: 30% DynamoDB throttling

Same demo, same load, restarted LocalEmu with SIMULATE_THROTTLING=1 DYNAMODB_THROTTLE_RATE=0.3.

[14:02:26] Clearing any previous run state (best effort)
  ✓ Clean slate

== 1. Create DynamoDB table ==
[14:02:28] Creating table chaos-writes
  ✓ Table chaos-writes ACTIVE

== 2. Create Lambda role ==
[14:02:28] Creating role chaos-lambda-role
  ✓ Role arn:aws:iam::000000000000:role/chaos-lambda-role

== 3. Package + create the resilient writer Lambda ==
[14:02:29] Zip built
  ✓ Lambda chaos-resilient-writer Active
  ✓ Setup complete. Ids in ./.state/ids.env

== 4. Load test: 50 requests, 10-way concurrent ==
[14:02:35] LocalEmu chaos config:
      DYNAMODB_THROTTLE_RATE=0.3
      SIMULATE_THROTTLING=1
[14:02:35] Firing 50 invocations…
  ✓ All invocations returned in 7s wall-clock

== 5. Summary ==
      requests:        50
      ok:              50
      failed:          0
      throttle events: 26
      sum(attempts):   76  (= 50 means zero retries needed)
      latency p50:     14 ms
      latency p95:     513 ms
      latency p99:     980 ms

== 6. Zero-data-loss assertion ==
[14:02:43] DynamoDB table holds 50 unique items
  ✓ Zero data loss: 50/50 requests landed in DDB

→ demo complete. Run scripts/teardown.sh when done.

Full demo output: 300 ms fixed latency

Same demo, same load, restarted LocalEmu with SIMULATE_LATENCY=300.

[14:04:07] Clearing any previous run state (best effort)
  ✓ Clean slate

== 1. Create DynamoDB table ==
[14:04:09] Creating table chaos-writes
  ✓ Table chaos-writes ACTIVE

== 2. Create Lambda role ==
[14:04:11] Creating role chaos-lambda-role
  ✓ Role arn:aws:iam::000000000000:role/chaos-lambda-role

== 3. Package + create the resilient writer Lambda ==
[14:04:12] Zip built
  ✓ Lambda chaos-resilient-writer Active
  ✓ Setup complete. Ids in ./.state/ids.env

== 4. Load test: 50 requests, 10-way concurrent ==
[14:04:19] LocalEmu chaos config:
      SIMULATE_LATENCY=300
[14:04:19] Firing 50 invocations…
  ✓ All invocations returned in 12s wall-clock

== 5. Summary ==
      requests:        50
      ok:              50
      failed:          0
      throttle events: 0
      sum(attempts):   50  (= 50 means zero retries needed)
      latency p50:     318 ms
      latency p95:     348 ms
      latency p99:     397 ms

== 6. Zero-data-loss assertion ==
[14:04:31] DynamoDB table holds 50 unique items
  ✓ Zero data loss: 50/50 requests landed in DDB

→ demo complete. Run scripts/teardown.sh when done.

Files

Repository layout.

21-chaos-resilience/
├── README.md
├── scripts/
│   ├── demo.sh
│   └── teardown.sh
├── lib/
│   └── common.sh         (awsx, detect_chaos helper)
├── src/
│   └── handler.py        (idempotent PutItem with bounded retry + jitter)
└── infra/
    ├── 01_setup.sh       (DDB table, IAM role, Lambda)
    └── 02_load_test.sh   (N concurrent invokes + zero-loss assertion)

Prev: Step Functions Saga All Use Cases →