Scaling to 100K Requests
A deep dive into how I architected and optimized a production system to handle 100,000 concurrent requests with sub-100ms latency and zero downtime during peak traffic.
When Traffic Spikes Break Systems
The system was originally designed to handle around 5,000 concurrent users. But as the business grew, traffic patterns changed dramatically. Flash sales and marketing campaigns would suddenly spike traffic to 50-100x normal levels.
The existing architecture struggled with these bursts — response times would climb to 10+ seconds, databases would max out connections, and eventually services would start failing. We needed a fundamental rearchitecture to handle scale gracefully.
Multi-Layer Scaling Strategy
The solution required changes at every layer of the stack. We implemented a multi-tier caching strategy, connection pooling, async processing, and horizontal auto-scaling to create a truly elastic system.
High-Level Architecture
Key Optimizations
Multi-Tier Caching
Connection Pooling
Async Request Processing
Horizontal Auto-Scaling
Database Read Replicas
from functools import wraps
from redis import Redis
import hashlib
import json
redis_client = Redis.from_url(settings.REDIS_URL)
def cache_response(ttl: int = 300, key_prefix: str = "api"):
"""Multi-tier caching decorator with automatic invalidation."""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
# Generate cache key from function args
cache_key = f"{key_prefix}:{func.__name__}:{hashlib.md5(
json.dumps(kwargs, sort_keys=True).encode()
).hexdigest()}"
# Check cache first
cached = await redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Execute function and cache result
result = await func(*args, **kwargs)
await redis_client.setex(
cache_key,
ttl,
json.dumps(result)
)
return result
return wrapper
return decorator
Performance Improvements
After implementing these optimizations, the system handled Black Friday traffic with zero downtime — a 20x increase from previous peaks. Response times stayed consistent even under extreme load.
| Metric | Before | After | Improvement |
|---|---|---|---|
| P99 Latency | 15,000 ms | 150 ms | 100x faster |
| Database Connections | 500 (maxed) | 100 (pooled) | 80% reduction |
| Cache Hit Rate | 0% | 92% | New capability |
| Scale-out Time | Manual (30+ min) | Auto (30 sec) | 60x faster |
| Infrastructure Cost | $15K/month | $8K/month | 47% savings |
Lessons Learned
- Cache Everything (Strategically) — The biggest performance gains came from intelligent caching. Not just static assets, but API responses, database queries, and computed values.
- Connection Pooling is Non-Negotiable — At scale, database connections become the bottleneck before CPU or memory. PgBouncer paid for itself many times over.
- Design for Failure — Circuit breakers, retries with backoff, and graceful degradation kept the system stable even when individual components failed.
- Measure Before Optimizing — APM tools and distributed tracing showed exactly where time was spent. Many assumptions about bottlenecks were wrong.
- Scale Horizontally First — Throwing more hardware at the problem is faster than micro-optimizing code. Optimize only after horizontal scaling saturates.
Need help scaling your system?
I help teams architect high-performance systems that handle traffic spikes gracefully.
Let's Talk