Back arrow Back to Home

Scaling to 100K Requests

⚡ Performance 📈 Scaling 💾 Caching 🔄 Load Balancing

A deep dive into how I architected and optimized a production system to handle 100,000 concurrent requests with sub-100ms latency and zero downtime during peak traffic.

When Traffic Spikes Break Systems

The system was originally designed to handle around 5,000 concurrent users. But as the business grew, traffic patterns changed dramatically. Flash sales and marketing campaigns would suddenly spike traffic to 50-100x normal levels.

The existing architecture struggled with these bursts — response times would climb to 10+ seconds, databases would max out connections, and eventually services would start failing. We needed a fundamental rearchitecture to handle scale gracefully.

0s
Avg Response Time
During peak traffic
0%
Error Rate
5xx errors under load
0K
Max Concurrent
System limit
0min
Recovery Time
After failures

Multi-Layer Scaling Strategy

The solution required changes at every layer of the stack. We implemented a multi-tier caching strategy, connection pooling, async processing, and horizontal auto-scaling to create a truly elastic system.

High-Level Architecture

🌐
CDN
CloudFlare
⚖️
Load Balancer
Nginx
🚀
API Gateway
Rate Limiting
⚙️
App Cluster
Auto-scaling
💾
Redis Cache
Cluster Mode
🗄️
PostgreSQL
Read Replicas
FastAPI FastAPI Redis Redis Cluster PostgreSQL PostgreSQL Kubernetes Kubernetes GCP GCP Cloud CDN

Key Optimizations

Step 01

Multi-Tier Caching

Implemented a three-layer caching strategy: CDN edge caching for static assets, application-level caching with Redis for API responses, and query-level caching for expensive database operations. Cache-aside pattern with TTL-based invalidation reduced database load by 80%.
Step 02

Connection Pooling

Replaced individual database connections with PgBouncer connection pooling in transaction mode. This allowed 1000+ application instances to share 100 persistent database connections, eliminating connection storms during traffic spikes.
Step 03

Async Request Processing

Migrated heavy operations (email notifications, report generation, third-party API calls) to background tasks using Celery with Redis broker. This freed up web workers to handle more incoming requests while maintaining eventual consistency.
Step 04

Horizontal Auto-Scaling

Configured Kubernetes Horizontal Pod Autoscaler (HPA) to scale based on custom metrics: requests per second, queue depth, and P95 latency. Pods spin up in under 30 seconds to absorb traffic spikes before they impact performance.
Step 05

Database Read Replicas

Set up PostgreSQL read replicas with automatic failover. Routed read-heavy queries (product listings, search, reports) to replicas while writes go to primary. This distributed the database load across multiple instances.
cache_middleware.py
from functools import wraps
from redis import Redis
import hashlib
import json

redis_client = Redis.from_url(settings.REDIS_URL)

def cache_response(ttl: int = 300, key_prefix: str = "api"):
    """Multi-tier caching decorator with automatic invalidation."""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            # Generate cache key from function args
            cache_key = f"{key_prefix}:{func.__name__}:{hashlib.md5(
                json.dumps(kwargs, sort_keys=True).encode()
            ).hexdigest()}"
            
            # Check cache first
            cached = await redis_client.get(cache_key)
            if cached:
                return json.loads(cached)
            
            # Execute function and cache result
            result = await func(*args, **kwargs)
            await redis_client.setex(
                cache_key, 
                ttl, 
                json.dumps(result)
            )
            return result
        return wrapper
    return decorator

Performance Improvements

After implementing these optimizations, the system handled Black Friday traffic with zero downtime — a 20x increase from previous peaks. Response times stayed consistent even under extreme load.

0ms
Avg Response Time
↓ 99% improvement
0%
Error Rate
↓ From 35%
0K
Concurrent Users
↑ 20x capacity
0s
Downtime
Zero incidents
Metric Before After Improvement
P99 Latency 15,000 ms 150 ms 100x faster
Database Connections 500 (maxed) 100 (pooled) 80% reduction
Cache Hit Rate 0% 92% New capability
Scale-out Time Manual (30+ min) Auto (30 sec) 60x faster
Infrastructure Cost $15K/month $8K/month 47% savings

Lessons Learned

  • Cache Everything (Strategically) — The biggest performance gains came from intelligent caching. Not just static assets, but API responses, database queries, and computed values.
  • Connection Pooling is Non-Negotiable — At scale, database connections become the bottleneck before CPU or memory. PgBouncer paid for itself many times over.
  • Design for Failure — Circuit breakers, retries with backoff, and graceful degradation kept the system stable even when individual components failed.
  • Measure Before Optimizing — APM tools and distributed tracing showed exactly where time was spent. Many assumptions about bottlenecks were wrong.
  • Scale Horizontally First — Throwing more hardware at the problem is faster than micro-optimizing code. Optimize only after horizontal scaling saturates.

Need help scaling your system?

I help teams architect high-performance systems that handle traffic spikes gracefully.

Let's Talk