KM-301b · Module 3

Backup, Versioning, and Recovery

5 min read

Knowledge bases fail in ways that are qualitatively different from application failures. A database outage is dramatic and immediately visible. A knowledge base failure is often quiet: a mass accidental deletion that goes unnoticed for days, a corrupted content export that looks complete but is missing 15% of items, a platform migration that silently drops all content older than a certain date. The recovery path for these failures requires a backup architecture that was designed proactively — not assembled from whatever survived the incident.

Full-Corpus Export on a Schedule Configure a nightly full-corpus export via the platform API to a storage system you control (S3, GCS, or equivalent). The export must include: all content in a structured format (JSON with all schema fields), all taxonomy definitions, all user permission assignments, and all content relationships. Verify the export count nightly — an export that completes but contains 40% fewer items than yesterday requires immediate investigation.
Content Versioning at the Item Level Every content item should have a version history: at minimum, the last 10 versions with timestamps, change author, and the body delta. Platform-native versioning is acceptable if it meets this bar. External versioning (export to git on every significant change) provides a more robust recovery path for content that was edited destructively. The recovery case for item-level versioning: a subject matter expert rewrites a runbook incorrectly and the error is discovered 3 weeks later. You need the previous version, not just the current one.
Recovery Runbook (Meta: A Runbook for the Knowledge Base) Document the recovery procedure for each failure class: single item deletion, bulk deletion, taxonomy corruption, platform outage, and platform data loss requiring full restore from backup. Test the restore procedure annually — an untested backup is an undiscovered failure. The recovery runbook should specify: who is notified, what commands or UI steps restore the data, how to verify restoration completeness, and how to communicate the incident to affected users.
RTO and RPO Definitions Establish Recovery Time Objective (how long the knowledge base can be unavailable) and Recovery Point Objective (how much content change can be lost). For most enterprise knowledge bases: RTO of 4 hours (knowledge is important but not real-time critical) and RPO of 24 hours (a day's worth of edits can be reconstructed). If your RTO/RPO requirements are tighter, the backup architecture needs a more frequent export cadence and a hot standby.

# Backup verification checklist — run nightly
checks:
  export_completeness:
    description: "Item count in last export matches platform count"
    query: "SELECT COUNT(*) FROM kb_export WHERE date = yesterday"
    expected: "Within 1% of platform API count"
    failure_action: "Page on-call knowledge admin. Do not discard yesterday's backup."

  taxonomy_integrity:
    description: "All taxonomy definitions exported and non-empty"
    check: "exported_taxonomy.facets.length > 0"
    failure_action: "Flag for morning review. Taxonomy export may have failed silently."

  version_history:
    description: "Items with recent edits have version history captured"
    check: "All items modified in last 24h have at least 2 versions in export"
    failure_action: "Verify versioning export pipeline. Alert knowledge admin."

  backup_size_delta:
    description: "Export size is within expected range of previous day"
    check: "abs(today_size - yesterday_size) / yesterday_size < 0.15"
    failure_action: >
      Size delta > 15% indicates either mass deletion or mass addition.
      Verify intentionality before rotating backup. Retain both yesterday and today.

restore_test_schedule:
  frequency: quarterly
  scope: "Restore a random sample of 50 items to a staging environment. Verify content, metadata, and version history are intact."
  owner: knowledge_admin
  documentation: "rc-vault/sessions/kb-restore-test-{date}.md"