MP-301i · Module 1

SLO / SLI / SLA Design

3 min read

Service Level Indicators (SLIs) are the metrics you measure. For MCP servers, the critical SLIs are availability (percentage of time the server accepts connections), latency (p50, p95, p99 of tool invocation duration), error rate (percentage of tool invocations that return errors), and throughput (requests per second the server handles without degradation). Each SLI must have a precise measurement method: availability measured by synthetic health checks every 30 seconds, latency measured from JSON-RPC request receipt to response send, error rate excluding client errors (4xx), throughput measured at the load balancer.

Service Level Objectives (SLOs) are targets set on SLIs. An SLO says: "99.9% of tool invocations complete within 5 seconds, measured over a rolling 30-day window." The SLO defines the error budget — the allowable amount of failure. A 99.9% SLO over 30 days allows 43.2 minutes of downtime or approximately 0.1% of requests to fail. When the error budget is being consumed faster than expected, you slow down deployments and prioritize reliability. When the error budget is healthy, you can ship faster and accept more risk. Error budgets turn reliability into a quantitative trade-off, not a subjective argument.

Service Level Agreements (SLAs) are contractual commitments to customers, backed by financial consequences (credits, refunds). SLAs should always be less stringent than your internal SLOs — if your SLO is 99.9%, your SLA should be 99.5%. This gives you a buffer: you can breach your SLO (triggering internal response) without breaching your SLA (triggering financial penalties). Never set an SLA without a proven track record of meeting the corresponding SLO. An SLA you cannot meet is a liability, not a commitment.

Do This

Define SLIs with precise measurement methods — "availability" means nothing without specifying how it is measured
Set SLOs that are achievable based on historical data, not aspirational targets
Keep SLAs less stringent than SLOs — the buffer protects you from financial penalties
Track error budget burn rate and alert when it predicts an SLO breach

Avoid This

Set a 99.99% SLO without understanding that it allows only 4.3 minutes of downtime per month
Define SLOs without dashboards and alerts — an unmeasured SLO is a wish, not an objective
Set SLAs equal to SLOs — you will breach the SLA the first time something goes wrong
Measure availability only from the server's perspective — measure from the client's perspective