Reliability and Interview Walkthroughs

Apply tracing, chaos engineering, error budgets, canaries, and full design walkthroughs.

How to Use This Lesson

Start with the user problem, then map the pattern to architecture and failure modes.
If a code or design example is included, change one assumption and reason through the impact.
Use role callouts, checklists, and Q&A sections as implementation or interview prep notes.

Free · email to track progress

System Design for AI & FDE

Free subscriber access. Unlock all 13 modules covering system design interview skills for AI/ML and Field Delivery Engineering roles.

Foundations to distributed systems — storage, APIs, reliability, and global AI infrastructure.
Interview-ready walkthroughs — LLM serving, RAG, multi-agent, safety, and compliance scenarios.
Browser-local progress — track completion privately, no account needed.

Intermediate interviews often ask you to move from component knowledge into an end-to-end design. Reliability is where vague designs break: what happens during deploys, partial outages, hot keys, duplicate messages, and dependency failures?

Tracing And Failure Testing

Distributed tracing gives each request a trace ID and records spans across services. A good trace for a checkout, notification, or LLM request shows gateway time, service time, cache time, database time, queue publish time, and external provider time.

Chaos engineering deliberately injects controlled failure to validate assumptions. Start small: kill one worker, add latency to Redis, make a provider return 500s, or pause a queue consumer. The point is not drama; it is proving fallback behavior before customers discover the failure mode.

Error Budgets And Deployments

Use the error budget from your SLO to govern release risk. If latency and availability are healthy, ship normally. If the budget is nearly gone, freeze risky releases and invest in reliability.

Blue-green deployment runs two complete environments and switches traffic from old to new. It gives fast rollback but costs more.

Canary deployment sends a small percentage of traffic to the new version, watches metrics, then ramps up. It catches issues gradually but needs good segmentation and automated rollback.

Full Walkthrough: Rate Limiter

Requirements: enforce per-user, per-tenant, and per-IP limits for an API. Support free and paid tiers. Return clear retry information. Handle 100,000 requests per second globally.

Capacity: every request performs at least one limiter check. At 100,000 RPS, the limiter must be low latency and horizontally scalable. A database row update per request is too slow.

Algorithms:

Algorithm	Use	Limitation
Fixed window	Simple counters	Boundary bursts
Sliding log	Exact request history	High memory
Sliding window counter	Good approximation	Slight inaccuracy
Token bucket	Average limit plus bursts	Needs careful refill math
Leaky bucket	Smooth outbound rate	Queues or rejects bursts

Architecture: an L7 API gateway extracts principal, route, and tier. A rate limiter service uses Redis Cluster for counters and Lua scripts for atomic check-and-update. Configuration lives in a database and is cached locally. Decisions are logged asynchronously.

Redis key shape:

rl:{tenant_id}:{route}:{window_start}
rl:{ip}:{window_start}

For a token bucket, store current token count and last refill timestamp. The Lua script computes refill, checks availability, decrements tokens, sets TTL, and returns allowed plus retry delay.

Global scale: route a tenant consistently to a home region when strict global limits matter. For softer abuse limits, use regional limits plus asynchronous aggregation.

Failure behavior: if Redis is slow, use a local emergency limiter with small in-memory quotas for a few seconds. For expensive model-generation routes, fail closed or degrade to lower quotas. For low-cost metadata reads, fail open with alerting.

Observability: track allow rate, block rate, Redis latency, script errors, hot keys, top blocked tenants, and false-positive support tickets.

Mini Walkthrough: Video Streaming

Requirements: start playback quickly, avoid buffering, support multiple bitrates, and keep origin traffic low.

Architecture: videos are transcoded into adaptive bitrate segments, stored in object storage, and distributed through CDN edges. Clients request manifests and switch bitrates based on bandwidth. Popular content is pre-positioned near users; rare content is pulled on demand.

Reliability: origin failures should not stop cached playback. Metrics focus on startup time, rebuffering ratio, CDN hit rate, and segment error rate.

Design Checklist

Instrument traces before optimizing unknown bottlenecks.
Run small chaos tests against real fallback assumptions.
Use canaries for risky service changes.
Pick a rate limiter algorithm based on fairness, memory, and burst behavior.
Define fail-open and fail-closed behavior per endpoint.
Make rollback faster than diagnosis.

Interview Practice

What spans would you expect in a trace for a notification send?
How would you test Redis failure safely in staging?
Compare blue-green and canary deployments.
Which rate limiter algorithm best supports short bursts?
Why is an in-memory rate limiter incorrect behind many API servers?
How would you enforce global quotas across regions?
What does fail open mean for a rate limiter, and when is it acceptable?
Which metrics would detect a bad video streaming deploy?