CI-301b · Module 3

Source Reliability Engineering

3 min read

Sources break. APIs change authentication requirements. Websites restructure their pages. RSS feeds go offline. Financial data providers change their schemas. Source reliability engineering treats these failures as expected events, not emergencies. Every source in the network has a failure mode catalog: what can go wrong, how to detect it, and how to recover. Collection adapters include automated health checks that detect failures within minutes and fallback mechanisms that maintain coverage while the primary source is unavailable.

Do This

Catalog failure modes for every source — authentication expiry, schema change, rate limiting, downtime
Build automated health checks into every collection adapter — detect failures, do not discover them
Maintain fallback sources for critical intelligence requirements — if Source A fails, Source B covers
Test recovery procedures regularly — a recovery plan that has never been tested is a hypothesis

Avoid This

Treat source failures as one-off incidents — they are recurring events that need systematic handling
Rely on a single source for any critical intelligence requirement — single points of failure are pipeline risks
Wait for consumers to notice missing intelligence before investigating — monitor collection health proactively