CI-301b · Module 3

Source Reliability Engineering

3 min read

Sources break. APIs change authentication requirements. Websites restructure their pages. RSS feeds go offline. Financial data providers change their schemas. Source reliability engineering treats these failures as expected events, not emergencies. Every source in the network has a failure mode catalog: what can go wrong, how to detect it, and how to recover. Collection adapters include automated health checks that detect failures within minutes and fallback mechanisms that maintain coverage while the primary source is unavailable.

Do This

  • Catalog failure modes for every source — authentication expiry, schema change, rate limiting, downtime
  • Build automated health checks into every collection adapter — detect failures, do not discover them
  • Maintain fallback sources for critical intelligence requirements — if Source A fails, Source B covers
  • Test recovery procedures regularly — a recovery plan that has never been tested is a hypothesis

Avoid This

  • Treat source failures as one-off incidents — they are recurring events that need systematic handling
  • Rely on a single source for any critical intelligence requirement — single points of failure are pipeline risks
  • Wait for consumers to notice missing intelligence before investigating — monitor collection health proactively