Intermediate interviews often ask you to move from component knowledge into an end-to-end design. Reliability is where vague designs break: what happens during deploys, partial outages, hot keys, duplicate messages, and dependency failures?
Tracing And Failure Testing
Distributed tracing gives each request a trace ID and records spans across services. A good trace for a checkout, notification, or LLM request shows gateway time, service time, cache time, database time, queue publish time, and external provider time.
Chaos engineering deliberately injects controlled failure to validate assumptions. Start small: kill one worker, add latency to Redis, make a provider return 500s, or pause a queue consumer. The point is not drama; it is proving fallback behavior before customers discover the failure mode.
Error Budgets And Deployments
Use the error budget from your SLO to govern release risk. If latency and availability are healthy, ship normally. If the budget is nearly gone, freeze risky releases and invest in reliability.
Blue-green deployment runs two complete environments and switches traffic from old to new. It gives fast rollback but costs more.
Canary deployment sends a small percentage of traffic to the new version, watches metrics, then ramps up. It catches issues gradually but needs good segmentation and automated rollback.
Full Walkthrough: Rate Limiter
Requirements: enforce per-user, per-tenant, and per-IP limits for an API. Support free and paid tiers. Return clear retry information. Handle 100,000 requests per second globally.
Capacity: every request performs at least one limiter check. At 100,000 RPS, the limiter must be low latency and horizontally scalable. A database row update per request is too slow.
Algorithms:
| Algorithm | Use | Limitation |
|---|---|---|
| Fixed window | Simple counters | Boundary bursts |
| Sliding log | Exact request history | High memory |
| Sliding window counter | Good approximation | Slight inaccuracy |
| Token bucket | Average limit plus bursts | Needs careful refill math |
| Leaky bucket | Smooth outbound rate | Queues or rejects bursts |
Architecture: an L7 API gateway extracts principal, route, and tier. A rate limiter service uses Redis Cluster for counters and Lua scripts for atomic check-and-update. Configuration lives in a database and is cached locally. Decisions are logged asynchronously.
Redis key shape:
rl:{tenant_id}:{route}:{window_start}
rl:{ip}:{window_start}
For a token bucket, store current token count and last refill timestamp. The Lua script computes refill, checks availability, decrements tokens, sets TTL, and returns allowed plus retry delay.
Global scale: route a tenant consistently to a home region when strict global limits matter. For softer abuse limits, use regional limits plus asynchronous aggregation.
Failure behavior: if Redis is slow, use a local emergency limiter with small in-memory quotas for a few seconds. For expensive model-generation routes, fail closed or degrade to lower quotas. For low-cost metadata reads, fail open with alerting.
Observability: track allow rate, block rate, Redis latency, script errors, hot keys, top blocked tenants, and false-positive support tickets.
Mini Walkthrough: Video Streaming
Requirements: start playback quickly, avoid buffering, support multiple bitrates, and keep origin traffic low.
Architecture: videos are transcoded into adaptive bitrate segments, stored in object storage, and distributed through CDN edges. Clients request manifests and switch bitrates based on bandwidth. Popular content is pre-positioned near users; rare content is pulled on demand.
Reliability: origin failures should not stop cached playback. Metrics focus on startup time, rebuffering ratio, CDN hit rate, and segment error rate.
Design Checklist
- Instrument traces before optimizing unknown bottlenecks.
- Run small chaos tests against real fallback assumptions.
- Use canaries for risky service changes.
- Pick a rate limiter algorithm based on fairness, memory, and burst behavior.
- Define fail-open and fail-closed behavior per endpoint.
- Make rollback faster than diagnosis.
Interview Practice
- What spans would you expect in a trace for a notification send?
- How would you test Redis failure safely in staging?
- Compare blue-green and canary deployments.
- Which rate limiter algorithm best supports short bursts?
- Why is an in-memory rate limiter incorrect behind many API servers?
- How would you enforce global quotas across regions?
- What does fail open mean for a rate limiter, and when is it acceptable?
- Which metrics would detect a bad video streaming deploy?