Event-Driven Architecture
Building resilient, loosely-coupled systems with Apache Kafka and event sourcing, enabling real-time data processing across 15+ microservices with guaranteed delivery.
Breaking Free from Synchronous Chains
Our clinical trials platform had grown into a tangle of synchronous HTTP calls between services. When the patient enrollment service called the notification service, which called the analytics service, which called the audit service... a failure anywhere broke the entire chain.
Peak times saw cascading failures that took down multiple services. Adding new features meant modifying multiple services. Data inconsistencies arose when some services succeeded while others failed mid-transaction.
Synchronous Coupling Problem
Event-Driven Decoupling
We redesigned the system around events as first-class citizens. Services publish domain events to Kafka topics. Other services subscribe to events they care about. Each service maintains its own data store optimized for its query patterns (CQRS).
Event-Driven Architecture with Kafka
Key Patterns
Event Sourcing
CQRS (Command Query Separation)
Saga Pattern for Distributed Transactions
Schema Evolution with Avro
Dead Letter Queues & Retry
@Service
@RequiredArgsConstructor
public class PatientEventPublisher {
private final KafkaTemplate kafkaTemplate;
private final EventStore eventStore;
@Transactional
public void enrollPatient(EnrollPatientCommand command) {
// Create domain event
PatientEnrolledEvent event = PatientEnrolledEvent.builder()
.patientId(command.getPatientId())
.studyId(command.getStudyId())
.siteId(command.getSiteId())
.enrolledAt(Instant.now())
.enrolledBy(command.getEnrolledBy())
.build();
// Store in event log (source of truth)
eventStore.append("patient", command.getPatientId(), event);
// Publish to Kafka for downstream consumers
kafkaTemplate.send(
"patient.enrolled",
command.getPatientId(), // Partition key for ordering
event
).addCallback(
result -> log.info("Event published: {}", event),
ex -> log.error("Failed to publish event", ex)
);
}
}
@Component
@RequiredArgsConstructor
public class NotificationConsumer {
private final NotificationService notificationService;
@KafkaListener(
topics = "patient.enrolled",
groupId = "notification-service",
containerFactory = "kafkaListenerContainerFactory"
)
@Retryable(
value = {TransientException.class},
maxAttempts = 3,
backoff = @Backoff(delay = 1000, multiplier = 2)
)
public void onPatientEnrolled(
@Payload PatientEnrolledEvent event,
@Header(KafkaHeaders.RECEIVED_KEY) String patientId,
Acknowledgment ack
) {
try {
// Send welcome notification
notificationService.sendWelcomeEmail(event);
notificationService.sendSMSConfirmation(event);
// Commit offset only after successful processing
ack.acknowledge();
} catch (TransientException e) {
// Will be retried automatically
throw e;
} catch (Exception e) {
// Non-retryable: send to DLQ
log.error("Failed to process event: {}", event, e);
throw new NonRetryableException(e);
}
}
}
Resilience Improvements
The event-driven architecture transformed system reliability. Services can now be deployed, scaled, and fail independently. Development velocity increased as teams can work on their services without coordinating releases.
| Aspect | Synchronous | Event-Driven |
|---|---|---|
| Coupling | Tight (compile-time) | Loose (runtime) |
| Failure Scope | Cascading | Isolated |
| Scaling | Vertical (whole chain) | Horizontal (per consumer) |
| Deployability | Coordinated releases | Independent releases |
| Replay/Debug | Logs only | Full event history |
Lessons Learned
- Events Are Immutable Facts — Design events as facts about what happened, not commands for what to do. This makes them durable and replayable.
- Schema Evolution Matters — Plan for schema changes from day one. Breaking changes cascade through all consumers and are expensive to fix.
- Idempotency Is Non-Negotiable — Consumers must handle duplicate events gracefully. Network issues, retries, and rebalancing all cause redelivery.
- Eventual Consistency Requires UX Changes — Users need feedback that operations are "in progress" rather than instant confirmation. Set expectations correctly.
- Observability Becomes Critical — Distributed tracing across events is harder than tracing HTTP calls. Invest in correlation IDs and event flow visualization.
Ready to go event-driven?
I help teams design and implement event-driven architectures that scale.
Let's Talk