Async Task Processing with Celery
Implementing robust background job processing for an order management system handling 5,000+ daily transactions with guaranteed delivery, automatic retries, and real-time monitoring.
When Request Handlers Do Too Much
The clients platform's order API was doing too much work synchronously. Creating an order triggered: SAP ERP sync, inventory reservation, notification emails, PDF invoice generation, analytics tracking, and audit logging — all within the HTTP request.
Response times averaged 8-15 seconds per order. Under load, timeouts increased. If any downstream service failed, the entire order creation failed. Users were abandoning orders due to perceived slowness, and operations team was getting overwhelmed with support tickets.
Synchronous Processing Problem
Async-First Design with Celery
The solution was to acknowledge orders immediately and process everything else asynchronously. The API validates the order, persists it with "pending" status, and returns within 200ms. Background workers handle all downstream processing.
Async Processing Architecture
Key Implementation Details
Dedicated Queues & Workers
Automatic Retry with Backoff
Task Chaining & Workflows
Idempotent Task Design
Real-Time Monitoring
from celery import shared_task, chain, chord
from celery.exceptions import MaxRetriesExceededError
import structlog
logger = structlog.get_logger()
@shared_task(
bind=True,
queue="sap",
max_retries=5,
autoretry_for=(SAPConnectionError, SAPTimeoutError),
retry_backoff=True,
retry_backoff_max=600,
retry_jitter=True,
)
def sync_order_to_sap(self, order_id: str):
"""Sync order to SAP ERP with automatic retry and idempotency."""
log = logger.bind(order_id=order_id, task_id=self.request.id)
try:
# Idempotency check
if sap_service.order_exists(order_id):
log.info("Order already synced to SAP, skipping")
return {"status": "already_synced"}
order = order_repository.get(order_id)
result = sap_service.create_order(order)
# Update status
order.sap_order_number = result.order_number
order.sap_synced_at = datetime.utcnow()
order_repository.save(order)
log.info("Order synced to SAP", sap_order=result.order_number)
return {"status": "synced", "sap_order": result.order_number}
except MaxRetriesExceededError:
log.error("Max retries exceeded for SAP sync")
alert_ops("SAP sync failed permanently", order_id=order_id)
raise
from celery import chain, chord, group
from tasks.order_tasks import (
validate_order, reserve_inventory, sync_order_to_sap,
send_confirmation_email, generate_invoice_pdf, update_analytics
)
def process_order_workflow(order_id: str):
"""
Order processing workflow using Celery primitives.
Flow:
1. Validate order (must complete first)
2. Reserve inventory (depends on validation)
3. Sync to SAP (depends on inventory)
4. Parallel: email + PDF + analytics (after SAP sync)
"""
workflow = chain(
# Sequential dependencies
validate_order.s(order_id),
reserve_inventory.s(),
sync_order_to_sap.s(),
# Parallel processing after sync
chord(
group(
send_confirmation_email.s(),
generate_invoice_pdf.s(),
update_analytics.s(),
),
finalize_order.s(order_id)
)
)
# Execute workflow asynchronously
return workflow.apply_async()
# API endpoint stays fast
@router.post("/orders")
async def create_order(order: OrderCreate):
# Quick validation and persist
order_record = await order_service.create(order)
# Queue background processing
process_order_workflow.delay(order_record.id)
# Return immediately (~200ms)
return {"order_id": order_record.id, "status": "processing"}
Performance & Reliability Improvements
The async architecture transformed the order experience. Users get instant feedback, system reliability improved dramatically, and the operations team can finally focus on strategic work instead of firefighting.
| Metric | Synchronous | Async (Celery) | Impact |
|---|---|---|---|
| Order API Response | 12,000 ms avg | 200 ms avg | 60x faster |
| Timeout Rate | 23% | 0% | Eliminated |
| Cart Abandonment | 15% | 3% | 80% reduction |
| Failed Transactions | 8% permanent | 0.1% after retry | 80x improvement |
| Throughput Capacity | 50 orders/min | 500+ orders/min | 10x capacity |
Lessons Learned
- Acknowledge Fast, Process Later — Users don't need to wait for background processing. Return immediately with a status, then process asynchronously. The UX improvement is dramatic.
- Design for Failure — External calls fail. Local services crash. Network partitions happen. Build retry logic, dead letter queues, and alerting into every task.
- Monitor Queue Health — Queue depth, processing time, and worker utilization tell you when to scale. Set up alerts before problems become outages.
- Tune for Task Characteristics — CPU-bound tasks (PDF generation) need different concurrency than I/O-bound tasks (API calls). One size does not fit all.
- Idempotency Is Non-Negotiable — Tasks will be retried. Workers will crash mid-execution. Design every task to be safely re-executable.
Need to scale your processing?
I help teams design robust async architectures that handle any workload.
Let's Talk