KM-301h · Module 3

Sync Failures & Staleness Propagation

3 min read

Every push integration will eventually experience a sync failure — a knowledge update that was not delivered to one or more integrated tools because the tool was unavailable, the network was interrupted, or the integration service encountered an error. The failure itself is not the problem. How the system behaves after the failure is the problem. Staleness propagation — the state where one integrated tool has current knowledge and another has outdated knowledge, with no visible signal that the discrepancy exists — is the most damaging failure mode in knowledge integration.

  1. Delivery Guarantees Define the delivery guarantee for each push integration: at-most-once (send once, do not retry, accept loss), at-least-once (retry until acknowledged, accept duplicates), exactly-once (most complex, requires idempotency on the receiver). For knowledge integrations, at-least-once with receiver-side idempotency is the standard pattern. Critical knowledge updates that must not be missed require at-least-once with idempotency keys that prevent duplicate application.
  2. Staleness Signals Every integrated tool should display the timestamp of the last knowledge sync and flag when the sync is beyond the acceptable staleness threshold. "Knowledge last updated 4 days ago — verify before use" is more useful than stale knowledge served without any indication that it may be outdated. Staleness signals require the integration to track and expose sync metadata, not just content.
  3. Sync Failure Recovery When a sync failure is detected, the recovery pattern is: queue the failed update for retry, alert the integration owner, and flag the affected tool as potentially stale. The retry queue processes with exponential backoff. The staleness flag lifts when the retry succeeds. The integration owner investigates if the failure persists beyond the maximum retry window. Recovery is automated; investigation is human.