MP-301i · Module 3

Configuration Drift Detection

3 min read

Configuration drift occurs when MCP server instances in the same fleet diverge in their configuration — different environment variables, different feature flags, different timeout values, different tool lists. Drift happens gradually: an operator hot-patches one instance during an incident, a deployment partially fails leaving some instances on the old config, or an A/B test flag is set on some instances and forgotten. The result is inconsistent behavior: the same tool call succeeds on one instance and fails on another, depending on which instance the load balancer chose.

Drift detection requires a canonical configuration source and a comparison mechanism. Store the desired configuration in a version-controlled config repository or a configuration management service (Consul, etcd, AWS AppConfig). Periodically query each running instance for its effective configuration (via a /config endpoint or management API) and compare against the canonical version. Flag any instance where the effective configuration does not match the canonical source. Automate the comparison — manual spot-checks catch drift eventually, but automated continuous comparison catches it within minutes.

Drift remediation has two modes: convergence and replacement. Convergence updates the drifted instance's configuration to match the canonical source — push the correct config and reload. Replacement terminates the drifted instance and launches a new one from the canonical configuration. Replacement is safer because it eliminates any state that may have accumulated alongside the drift (leaked connections, cached stale data, modified files). In containerized deployments, replacement is the default: kill the drifted container, the orchestrator launches a new one from the canonical image.

Do This

Store canonical configuration in version control or a configuration management service
Expose an /config or /status endpoint on each instance for automated drift detection
Run drift checks continuously (every 5 minutes) and alert on any divergence
Prefer replacement over convergence — new instances from canonical config have no residual state

Avoid This

Hot-patch individual instances during incidents without updating the canonical config
Rely on manual inspection to detect drift — it only catches drift during incidents
Allow operators to set environment variables directly on instances bypassing the config source
Ignore drift on non-production environments — staging drift leads to "works on my machine" bugs