OC-301c · Module 3
Memory Failure & Recovery
3 min read
Memory systems fail. The store becomes corrupted. The indexing service goes down. A bad batch write contaminates a pool with incorrect data. An agent retrieves a memory from the wrong context and acts on it inappropriately. Memory failure recovery is the set of procedures that detect, contain, and remediate memory failures before they cascade into operational failures.
The failure taxonomy has four categories. Corruption: the memory store returns malformed data. Recovery: restore from the most recent backup and replay writes from the transaction log. Contamination: incorrect data was written to a shared pool. Recovery: identify the contaminated entries, flag all agent decisions made using those entries, quarantine the entries, and notify affected agents to re-evaluate. Index failure: the retrieval system cannot find memories that exist. Recovery: rebuild the index from the raw store. Misattribution: a memory was retrieved in the wrong context. Recovery: improve the retrieval relevance scoring and add context guards to the memory query.
Do This
- Maintain transaction logs for all memory writes — they are your replay capability after corruption
- Back up the memory store daily with point-in-time recovery capability
- Test recovery procedures quarterly — an untested recovery procedure is an assumption, not a capability
Avoid This
- Assume the memory store is reliable — every persistence layer fails eventually
- Detect contamination only when an agent produces wrong output — by then the contamination has spread
- Rebuild from scratch when recovery fails — you lose institutional knowledge that took months to accumulate