MP-301i · Module 3
Configuration Drift Detection
3 min read
Configuration drift occurs when MCP server instances in the same fleet diverge in their configuration — different environment variables, different feature flags, different timeout values, different tool lists. Drift happens gradually: an operator hot-patches one instance during an incident, a deployment partially fails leaving some instances on the old config, or an A/B test flag is set on some instances and forgotten. The result is inconsistent behavior: the same tool call succeeds on one instance and fails on another, depending on which instance the load balancer chose.
Drift detection requires a canonical configuration source and a comparison mechanism. Store the desired configuration in a version-controlled config repository or a configuration management service (Consul, etcd, AWS AppConfig). Periodically query each running instance for its effective configuration (via a /config endpoint or management API) and compare against the canonical version. Flag any instance where the effective configuration does not match the canonical source. Automate the comparison — manual spot-checks catch drift eventually, but automated continuous comparison catches it within minutes.
Drift remediation has two modes: convergence and replacement. Convergence updates the drifted instance's configuration to match the canonical source — push the correct config and reload. Replacement terminates the drifted instance and launches a new one from the canonical configuration. Replacement is safer because it eliminates any state that may have accumulated alongside the drift (leaked connections, cached stale data, modified files). In containerized deployments, replacement is the default: kill the drifted container, the orchestrator launches a new one from the canonical image.
Do This
- Store canonical configuration in version control or a configuration management service
- Expose an /config or /status endpoint on each instance for automated drift detection
- Run drift checks continuously (every 5 minutes) and alert on any divergence
- Prefer replacement over convergence — new instances from canonical config have no residual state
Avoid This
- Hot-patch individual instances during incidents without updating the canonical config
- Rely on manual inspection to detect drift — it only catches drift during incidents
- Allow operators to set environment variables directly on instances bypassing the config source
- Ignore drift on non-production environments — staging drift leads to "works on my machine" bugs