SA-201b · Module 3

Failure Handling at Integration Points

3 min read

Integration points are where systems break. The external API that times out. The message broker that loses a message. The database that rejects a write because of a constraint violation the documentation did not mention. Every integration point is a failure surface, and the architecture must handle failure at every surface — not as an exception, but as an expected state.

Circuit Breakers When a downstream service is failing, stop calling it. The circuit breaker pattern monitors failure rates and opens the circuit — redirecting calls to a fallback — when the failure rate exceeds a threshold. This prevents cascading failures where one failing service overwhelms every service that depends on it. The circuit breaker is the architectural equivalent of a fuse: it sacrifices the failing path to protect the system.
Retry with Backoff Transient failures — network timeouts, temporary unavailability — resolve on retry. But retrying immediately can amplify the problem. Exponential backoff spaces retries progressively: 1 second, 2 seconds, 4 seconds, 8 seconds. Add jitter — random variance in the delay — to prevent retry storms where every consumer retries at exactly the same time.
Dead Letter Queues Messages that cannot be processed after maximum retries need a destination — not the void. Dead letter queues capture failed messages for diagnosis and reprocessing. Every message in the dead letter queue is a data point about what went wrong. Without it, failed messages disappear and the failure is invisible.

Do This

Implement circuit breakers on every external service call — protect the system from cascading failures
Use exponential backoff with jitter for transient failure retry — spacing prevents amplification
Route unprocessable messages to dead letter queues for diagnosis — invisible failures are the most dangerous

Avoid This

Retry immediately and indefinitely — you will amplify the failure instead of recovering from it
Swallow errors silently — a failure that nobody knows about is a failure that nobody fixes
Assume downstream services are always available — they are not, and your architecture must handle that reality