Back arrow Back to Home

Event-Driven Architecture

📨 Kafka 🔄 Event Sourcing 📊 CQRS 🔗 Microservices

Building resilient, loosely-coupled systems with Apache Kafka and event sourcing, enabling real-time data processing across 15+ microservices with guaranteed delivery.

Breaking Free from Synchronous Chains

Our clinical trials platform had grown into a tangle of synchronous HTTP calls between services. When the patient enrollment service called the notification service, which called the analytics service, which called the audit service... a failure anywhere broke the entire chain.

Peak times saw cascading failures that took down multiple services. Adding new features meant modifying multiple services. Data inconsistencies arose when some services succeeded while others failed mid-transaction.

Synchronous Coupling Problem

Enrollment ⚠️ Timeout HTTP Notification ⚠️ Blocked HTTP Analytics ⚠️ Down HTTP Audit ⚠️ Waiting Cascade Failure: One service down = All services fail
0min
Avg Recovery Time
After failures
0
Services Affected
Per incident
0%
Transaction Failures
Data inconsistency
0wks
Feature Lead Time
Cross-service changes

Event-Driven Decoupling

We redesigned the system around events as first-class citizens. Services publish domain events to Kafka topics. Other services subscribe to events they care about. Each service maintains its own data store optimized for its query patterns (CQRS).

Event-Driven Architecture with Kafka

Enrollment Producer Patient App Producer Site Portal Producer Apache Kafka patient.enrolled consent.signed visit.completed Notification Consumer Group Analytics Consumer Group Audit Log Consumer Group Reporting Consumer Group Redis ClickHouse MongoDB PostgreSQL ✓ Services are decoupled • ✓ Independent deployment • ✓ Failure isolation
Kafka Apache Kafka Spring Spring Cloud Stream K8s Kubernetes Schema Registry Avro Debezium CDC

Key Patterns

Pattern 01

Event Sourcing

Instead of storing current state, we store the sequence of events that led to that state. The patient enrollment becomes a stream of events: PatientRegistered → ConsentSigned → BaselineCompleted → VisitScheduled. Any point in time state can be reconstructed by replaying events.
Pattern 02

CQRS (Command Query Separation)

Commands (writes) and Queries (reads) use separate models optimized for their purpose. The command side validates and publishes events. Consumer services build read-optimized projections from events. Analytics uses ClickHouse, notifications use Redis, audit uses append-only MongoDB.
Pattern 03

Saga Pattern for Distributed Transactions

Long-running transactions span multiple services via choreographed sagas. Each step publishes an event that triggers the next step. Compensation events handle rollbacks. The entire enrollment workflow completes eventually even if individual services are temporarily unavailable.
Pattern 04

Schema Evolution with Avro

All events use Avro schemas registered in Confluent Schema Registry. Forward and backward compatibility rules prevent breaking changes. Consumers can process events from older producers without code changes, enabling independent deployment cycles.
Pattern 05

Dead Letter Queues & Retry

Failed events route to dead letter topics instead of blocking the consumer. Automatic retry with exponential backoff handles transient failures. Monitoring alerts on DLQ growth. Events are never lost, only delayed.
PatientEventPublisher.java
@Service
@RequiredArgsConstructor
public class PatientEventPublisher {
    
    private final KafkaTemplate kafkaTemplate;
    private final EventStore eventStore;
    
    @Transactional
    public void enrollPatient(EnrollPatientCommand command) {
        // Create domain event
        PatientEnrolledEvent event = PatientEnrolledEvent.builder()
            .patientId(command.getPatientId())
            .studyId(command.getStudyId())
            .siteId(command.getSiteId())
            .enrolledAt(Instant.now())
            .enrolledBy(command.getEnrolledBy())
            .build();
        
        // Store in event log (source of truth)
        eventStore.append("patient", command.getPatientId(), event);
        
        // Publish to Kafka for downstream consumers
        kafkaTemplate.send(
            "patient.enrolled",
            command.getPatientId(),  // Partition key for ordering
            event
        ).addCallback(
            result -> log.info("Event published: {}", event),
            ex -> log.error("Failed to publish event", ex)
        );
    }
}
NotificationConsumer.java
@Component
@RequiredArgsConstructor
public class NotificationConsumer {
    
    private final NotificationService notificationService;
    
    @KafkaListener(
        topics = "patient.enrolled",
        groupId = "notification-service",
        containerFactory = "kafkaListenerContainerFactory"
    )
    @Retryable(
        value = {TransientException.class},
        maxAttempts = 3,
        backoff = @Backoff(delay = 1000, multiplier = 2)
    )
    public void onPatientEnrolled(
        @Payload PatientEnrolledEvent event,
        @Header(KafkaHeaders.RECEIVED_KEY) String patientId,
        Acknowledgment ack
    ) {
        try {
            // Send welcome notification
            notificationService.sendWelcomeEmail(event);
            notificationService.sendSMSConfirmation(event);
            
            // Commit offset only after successful processing
            ack.acknowledge();
            
        } catch (TransientException e) {
            // Will be retried automatically
            throw e;
        } catch (Exception e) {
            // Non-retryable: send to DLQ
            log.error("Failed to process event: {}", event, e);
            throw new NonRetryableException(e);
        }
    }
}

Resilience Improvements

The event-driven architecture transformed system reliability. Services can now be deployed, scaled, and fail independently. Development velocity increased as teams can work on their services without coordinating releases.

0%
Message Delivery
Guaranteed
0
Cascade Failures
Full isolation
0wks
Feature Lead Time
↓ 67% faster
0K
Events/Second
Peak throughput
Aspect Synchronous Event-Driven
Coupling Tight (compile-time) Loose (runtime)
Failure Scope Cascading Isolated
Scaling Vertical (whole chain) Horizontal (per consumer)
Deployability Coordinated releases Independent releases
Replay/Debug Logs only Full event history

Lessons Learned

  • Events Are Immutable Facts — Design events as facts about what happened, not commands for what to do. This makes them durable and replayable.
  • Schema Evolution Matters — Plan for schema changes from day one. Breaking changes cascade through all consumers and are expensive to fix.
  • Idempotency Is Non-Negotiable — Consumers must handle duplicate events gracefully. Network issues, retries, and rebalancing all cause redelivery.
  • Eventual Consistency Requires UX Changes — Users need feedback that operations are "in progress" rather than instant confirmation. Set expectations correctly.
  • Observability Becomes Critical — Distributed tracing across events is harder than tracing HTTP calls. Invest in correlation IDs and event flow visualization.

Ready to go event-driven?

I help teams design and implement event-driven architectures that scale.

Let's Talk