15. Resiliency

Design for failure: timeouts, retries with jitter, circuit breakers, and backpressure.

Question: How would you implement a resilient network client that can handle transient failures?

Answer: A resilient client should implement retries with exponential backoff and jitter. For asynchronous code, timeouts are also critical to prevent requests from hanging indefinitely.

Explanation:

  • Exponential Backoff with Jitter: Instead of retrying immediately, wait for a delay that increases exponentially with each failed attempt (e.g., 1s, 2s, 4s). Adding a small, random amount of time (jitter) prevents a "thundering herd" problem where many clients retry at the exact same time.

  • Timeouts: Always set a timeout on network requests to avoid locking up resources. In asyncio, asyncio.wait_for is a common way to achieve this.

import asyncio, random

async def resilient_call():
    for attempt in range(5):
        try:
            # Add a timeout to the I/O operation
            return await asyncio.wait_for(do_io(), timeout=5) # do_io represents the actual network call
        except Exception as e:
            # Don't retry on all errors (e.g., 4xx client errors)
            if not is_retryable(e): # is_retryable checks if the exception is transient
                raise
            # Backoff with jitter
            delay = (2 ** attempt) + random.random()
            await asyncio.sleep(delay)
    raise RuntimeError("Operation failed after multiple retries")

Question: What is a circuit breaker and when should you use it?

Answer: A circuit breaker stops calls to an unhealthy dependency after repeated failures, then half-opens to probe recovery.

Explanation: Protects your system from cascading failures and reduces load on failing services.

class CircuitBreaker:
    def __init__(self, fail_threshold=5, reset_timeout=30):
        self.failures = 0
        self.open_until = 0
        self.fail_threshold = fail_threshold
        self.reset_timeout = reset_timeout

    async def call(self, fn, *a, **kw):
        import time
        now = time.time()
        if self.open_until > now:
            raise RuntimeError("circuit open")
        try:
            res = await fn(*a, **kw)
            self.failures = 0
            return res
        except Exception:
            self.failures += 1
            if self.failures >= self.fail_threshold:
                self.open_until = now + self.reset_timeout
            raise