Scaling to 100K Requests

01 — The Challenge

When Traffic Spikes Break Systems

The system was originally designed to handle around 5,000 concurrent users. But as the business grew, traffic patterns changed dramatically. Flash sales and marketing campaigns would suddenly spike traffic to 50-100x normal levels.

The existing architecture struggled with these bursts — response times would climb to 10+ seconds, databases would max out connections, and eventually services would start failing. We needed a fundamental rearchitecture to handle scale gracefully.

0s

Avg Response Time

During peak traffic

0%

Error Rate

5xx errors under load

0K

Max Concurrent

System limit

0min

Recovery Time

After failures

02 — Solution Architecture

Multi-Layer Scaling Strategy

The solution required changes at every layer of the stack. We implemented a multi-tier caching strategy, connection pooling, async processing, and horizontal auto-scaling to create a truly elastic system.

High-Level Architecture

🌐

CDN

CloudFlare

→

⚖️

Load Balancer

Nginx

→

🚀

API Gateway

Rate Limiting

→

⚙️

App Cluster

Auto-scaling

→

💾

Redis Cache

Cluster Mode

→

🗄️

PostgreSQL

Read Replicas

FastAPI

Redis Cluster

PostgreSQL

Kubernetes GCP

GCP Cloud CDN

03 — Implementation

Key Optimizations

Step 01

Multi-Tier Caching

Implemented a three-layer caching strategy: CDN edge caching for static assets, application-level caching with Redis for API responses, and query-level caching for expensive database operations. Cache-aside pattern with TTL-based invalidation reduced database load by 80%.

Step 02

Connection Pooling

Replaced individual database connections with PgBouncer connection pooling in transaction mode. This allowed 1000+ application instances to share 100 persistent database connections, eliminating connection storms during traffic spikes.

Step 03

Async Request Processing

Migrated heavy operations (email notifications, report generation, third-party API calls) to background tasks using Celery with Redis broker. This freed up web workers to handle more incoming requests while maintaining eventual consistency.

Step 04

Horizontal Auto-Scaling

Configured Kubernetes Horizontal Pod Autoscaler (HPA) to scale based on custom metrics: requests per second, queue depth, and P95 latency. Pods spin up in under 30 seconds to absorb traffic spikes before they impact performance.

Step 05

Database Read Replicas

Set up PostgreSQL read replicas with automatic failover. Routed read-heavy queries (product listings, search, reports) to replicas while writes go to primary. This distributed the database load across multiple instances.

from functools import wraps
from redis import Redis
import hashlib
import json

redis_client = Redis.from_url(settings.REDIS_URL)

def cache_response(ttl: int = 300, key_prefix: str = "api"):
    """Multi-tier caching decorator with automatic invalidation."""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            # Generate cache key from function args
            cache_key = f"{key_prefix}:{func.__name__}:{hashlib.md5(
                json.dumps(kwargs, sort_keys=True).encode()
            ).hexdigest()}"
            
            # Check cache first
            cached = await redis_client.get(cache_key)
            if cached:
                return json.loads(cached)
            
            # Execute function and cache result
            result = await func(*args, **kwargs)
            await redis_client.setex(
                cache_key, 
                ttl, 
                json.dumps(result)
            )
            return result
        return wrapper
    return decorator

04 — Results

Performance Improvements

After implementing these optimizations, the system handled Black Friday traffic with zero downtime — a 20x increase from previous peaks. Response times stayed consistent even under extreme load.

0ms

Avg Response Time

↓ 99% improvement

0%

Error Rate

↓ From 35%

0K

Concurrent Users

↑ 20x capacity

0s

Downtime

Zero incidents

Metric	Before	After	Improvement
P99 Latency	15,000 ms	150 ms	100x faster
Database Connections	500 (maxed)	100 (pooled)	80% reduction
Cache Hit Rate	0%	92%	New capability
Scale-out Time	Manual (30+ min)	Auto (30 sec)	60x faster
Infrastructure Cost	$15K/month	$8K/month	47% savings

05 — Key Takeaways

Lessons Learned

Cache Everything (Strategically) — The biggest performance gains came from intelligent caching. Not just static assets, but API responses, database queries, and computed values.
Connection Pooling is Non-Negotiable — At scale, database connections become the bottleneck before CPU or memory. PgBouncer paid for itself many times over.
Design for Failure — Circuit breakers, retries with backoff, and graceful degradation kept the system stable even when individual components failed.
Measure Before Optimizing — APM tools and distributed tracing showed exactly where time was spent. Many assumptions about bottlenecks were wrong.
Scale Horizontally First — Throwing more hardware at the problem is faster than micro-optimizing code. Optimize only after horizontal scaling saturates.

Need help scaling your system?

I help teams architect high-performance systems that handle traffic spikes gracefully.

Let's Talk