Event-Driven Architecture

01 — The Challenge

Breaking Free from Synchronous Chains

Our clinical trials platform had grown into a tangle of synchronous HTTP calls between services. When the patient enrollment service called the notification service, which called the analytics service, which called the audit service... a failure anywhere broke the entire chain.

Peak times saw cascading failures that took down multiple services. Adding new features meant modifying multiple services. Data inconsistencies arose when some services succeeded while others failed mid-transaction.

Synchronous Coupling Problem

0min

Avg Recovery Time

After failures

0

Services Affected

Per incident

0%

Transaction Failures

Data inconsistency

0wks

Feature Lead Time

Cross-service changes

02 — Solution Architecture

Event-Driven Decoupling

We redesigned the system around events as first-class citizens. Services publish domain events to Kafka topics. Other services subscribe to events they care about. Each service maintains its own data store optimized for its query patterns (CQRS).

Event-Driven Architecture with Kafka

Apache Kafka Spring

Spring Cloud Stream K8s

Kubernetes Schema Registry Avro Debezium CDC

03 — Implementation

Key Patterns

Pattern 01

Event Sourcing

Instead of storing current state, we store the sequence of events that led to that state. The patient enrollment becomes a stream of events: PatientRegistered → ConsentSigned → BaselineCompleted → VisitScheduled. Any point in time state can be reconstructed by replaying events.

Pattern 02

CQRS (Command Query Separation)

Commands (writes) and Queries (reads) use separate models optimized for their purpose. The command side validates and publishes events. Consumer services build read-optimized projections from events. Analytics uses ClickHouse, notifications use Redis, audit uses append-only MongoDB.

Pattern 03

Saga Pattern for Distributed Transactions

Long-running transactions span multiple services via choreographed sagas. Each step publishes an event that triggers the next step. Compensation events handle rollbacks. The entire enrollment workflow completes eventually even if individual services are temporarily unavailable.

Pattern 04

Schema Evolution with Avro

All events use Avro schemas registered in Confluent Schema Registry. Forward and backward compatibility rules prevent breaking changes. Consumers can process events from older producers without code changes, enabling independent deployment cycles.

Pattern 05

Dead Letter Queues & Retry

Failed events route to dead letter topics instead of blocking the consumer. Automatic retry with exponential backoff handles transient failures. Monitoring alerts on DLQ growth. Events are never lost, only delayed.

@Service
@RequiredArgsConstructor
public class PatientEventPublisher {
    
    private final KafkaTemplate kafkaTemplate;
    private final EventStore eventStore;
    
    @Transactional
    public void enrollPatient(EnrollPatientCommand command) {
        // Create domain event
        PatientEnrolledEvent event = PatientEnrolledEvent.builder()
            .patientId(command.getPatientId())
            .studyId(command.getStudyId())
            .siteId(command.getSiteId())
            .enrolledAt(Instant.now())
            .enrolledBy(command.getEnrolledBy())
            .build();
        
        // Store in event log (source of truth)
        eventStore.append("patient", command.getPatientId(), event);
        
        // Publish to Kafka for downstream consumers
        kafkaTemplate.send(
            "patient.enrolled",
            command.getPatientId(),  // Partition key for ordering
            event
        ).addCallback(
            result -> log.info("Event published: {}", event),
            ex -> log.error("Failed to publish event", ex)
        );
    }
}

@Component
@RequiredArgsConstructor
public class NotificationConsumer {
    
    private final NotificationService notificationService;
    
    @KafkaListener(
        topics = "patient.enrolled",
        groupId = "notification-service",
        containerFactory = "kafkaListenerContainerFactory"
    )
    @Retryable(
        value = {TransientException.class},
        maxAttempts = 3,
        backoff = @Backoff(delay = 1000, multiplier = 2)
    )
    public void onPatientEnrolled(
        @Payload PatientEnrolledEvent event,
        @Header(KafkaHeaders.RECEIVED_KEY) String patientId,
        Acknowledgment ack
    ) {
        try {
            // Send welcome notification
            notificationService.sendWelcomeEmail(event);
            notificationService.sendSMSConfirmation(event);
            
            // Commit offset only after successful processing
            ack.acknowledge();
            
        } catch (TransientException e) {
            // Will be retried automatically
            throw e;
        } catch (Exception e) {
            // Non-retryable: send to DLQ
            log.error("Failed to process event: {}", event, e);
            throw new NonRetryableException(e);
        }
    }
}

04 — Results

Resilience Improvements

The event-driven architecture transformed system reliability. Services can now be deployed, scaled, and fail independently. Development velocity increased as teams can work on their services without coordinating releases.

0%

Message Delivery

Guaranteed

0

Cascade Failures

Full isolation

0wks

Feature Lead Time

↓ 67% faster

0K

Events/Second

Peak throughput

Aspect	Synchronous	Event-Driven
Coupling	Tight (compile-time)	Loose (runtime)
Failure Scope	Cascading	Isolated
Scaling	Vertical (whole chain)	Horizontal (per consumer)
Deployability	Coordinated releases	Independent releases
Replay/Debug	Logs only	Full event history

05 — Key Takeaways

Lessons Learned

Events Are Immutable Facts — Design events as facts about what happened, not commands for what to do. This makes them durable and replayable.
Schema Evolution Matters — Plan for schema changes from day one. Breaking changes cascade through all consumers and are expensive to fix.
Idempotency Is Non-Negotiable — Consumers must handle duplicate events gracefully. Network issues, retries, and rebalancing all cause redelivery.
Eventual Consistency Requires UX Changes — Users need feedback that operations are "in progress" rather than instant confirmation. Set expectations correctly.
Observability Becomes Critical — Distributed tracing across events is harder than tracing HTTP calls. Invest in correlation IDs and event flow visualization.

Ready to go event-driven?

I help teams design and implement event-driven architectures that scale.

Let's Talk