SA-301e · Module 2
Handling Late-Arriving Data
3 min read
In batch processing, all data arrives before processing begins. In stream processing, data arrives whenever it arrives — sometimes late. A transaction that occurred at 2:58 PM may arrive at 3:04 PM, after the 3:00 PM window has closed. The architecture must decide: does the late event update the closed window, get assigned to the next window, or get dropped? Each choice has different correctness and complexity implications.
Do This
- Define a watermark — the maximum expected lateness — and allow window updates until the watermark passes
- Design downstream consumers to handle window result updates — the first result is provisional, not final
- Route events that arrive after the watermark to a late-data store for reconciliation in the next batch cycle
Avoid This
- Assume all events arrive in order — network latency, producer retries, and partition rebalancing guarantee they will not
- Drop late events silently — every dropped event is a data accuracy problem that nobody knows about
- Set watermarks without measuring actual event latency distribution — a watermark that is too short drops valid data, one that is too long delays results