SA-301e · Module 2

Handling Late-Arriving Data

3 min read

In batch processing, all data arrives before processing begins. In stream processing, data arrives whenever it arrives — sometimes late. A transaction that occurred at 2:58 PM may arrive at 3:04 PM, after the 3:00 PM window has closed. The architecture must decide: does the late event update the closed window, get assigned to the next window, or get dropped? Each choice has different correctness and complexity implications.

Do This

  • Define a watermark — the maximum expected lateness — and allow window updates until the watermark passes
  • Design downstream consumers to handle window result updates — the first result is provisional, not final
  • Route events that arrive after the watermark to a late-data store for reconciliation in the next batch cycle

Avoid This

  • Assume all events arrive in order — network latency, producer retries, and partition rebalancing guarantee they will not
  • Drop late events silently — every dropped event is a data accuracy problem that nobody knows about
  • Set watermarks without measuring actual event latency distribution — a watermark that is too short drops valid data, one that is too long delays results