15. Resiliency
Design for failure: timeouts, retries with jitter, circuit breakers, and backpressure.
Question: How would you implement a resilient network client that can handle transient failures?
Answer: A resilient client should implement retries with exponential backoff and jitter. For asynchronous code, timeouts are also critical to prevent requests from hanging indefinitely.
Explanation:
Exponential Backoff with Jitter: Instead of retrying immediately, wait for a delay that increases exponentially with each failed attempt (e.g., 1s, 2s, 4s). Adding a small, random amount of time (jitter) prevents a "thundering herd" problem where many clients retry at the exact same time.
Timeouts: Always set a timeout on network requests to avoid locking up resources. In
asyncio
,asyncio.wait_for
is a common way to achieve this.
import asyncio, random
async def resilient_call():
for attempt in range(5):
try:
# Add a timeout to the I/O operation
return await asyncio.wait_for(do_io(), timeout=5) # do_io represents the actual network call
except Exception as e:
# Don't retry on all errors (e.g., 4xx client errors)
if not is_retryable(e): # is_retryable checks if the exception is transient
raise
# Backoff with jitter
delay = (2 ** attempt) + random.random()
await asyncio.sleep(delay)
raise RuntimeError("Operation failed after multiple retries")
Question: What is a circuit breaker and when should you use it?
Answer: A circuit breaker stops calls to an unhealthy dependency after repeated failures, then half-opens to probe recovery.
Explanation: Protects your system from cascading failures and reduces load on failing services.
class CircuitBreaker:
def __init__(self, fail_threshold=5, reset_timeout=30):
self.failures = 0
self.open_until = 0
self.fail_threshold = fail_threshold
self.reset_timeout = reset_timeout
async def call(self, fn, *a, **kw):
import time
now = time.time()
if self.open_until > now:
raise RuntimeError("circuit open")
try:
res = await fn(*a, **kw)
self.failures = 0
return res
except Exception:
self.failures += 1
if self.failures >= self.fail_threshold:
self.open_until = now + self.reset_timeout
raise