CDX-301d · Module 1

Snapshot/Restore & Memory Management

3 min read

Firecracker supports full VM snapshots — freezing the entire state of a running microVM (memory contents, CPU registers, device state) to disk and restoring it later. This enables Codex Cloud to pre-warm sandboxes: boot a VM, clone the repository, install dependencies, load AGENTS.md, then snapshot the result. When a task arrives, the system restores from the snapshot instead of booting from scratch, reducing startup time from 15-60 seconds to under 2 seconds.

Memory management inside a Firecracker microVM uses balloon devices and demand paging. The hypervisor allocates a maximum memory limit, but the guest only consumes physical pages as needed. If a task allocates 6 GB in an 8 GB VM, 6 GB of host memory is consumed. When the VM is destroyed, all pages are immediately reclaimed. There is no swap by default — if a task exceeds its memory limit, the OOM killer terminates the process. This hard boundary is intentional: swap would allow tasks to run slowly instead of failing fast, making resource overcommitment harder to detect.

# Snapshot lifecycle

1. Base VM boots → repo cloned → deps installed → AGENTS.md loaded
2. VM state frozen → memory pages + CPU state written to snapshot file
3. Snapshot stored (typically 200-800 MB compressed)
4. Task arrives → snapshot restored → VM resumes in <2 seconds
5. Task executes → diff extracted → VM destroyed

# Memory allocation model

Max allocation:    8 GB (configurable per tier)
Physical backing:  On-demand (balloon device)
Swap:              None (OOM kill on exceed)
Overcommit:        Host-level only (not guest-visible)
Reclamation:       Instant on VM destroy

# OOM behavior
- Process exceeds limit → OOM killer fires
- Task fails with clear error → no silent degradation
- Logs capture peak memory usage for debugging

Do This

Monitor peak memory usage in task logs to right-size your VM memory allocation
Use snapshots for frequently executed task patterns — the amortized boot time approaches zero
Design tasks to fail fast on OOM rather than degrading silently with swap

Avoid This

Ignore OOM errors — they indicate your task needs more memory or a smaller working set
Assume snapshots are always fresh — dependency updates require snapshot regeneration
Over-allocate memory "just in case" — unused allocations still reserve host resources in warm pools