SA-201b · Module 1
API Patterns for AI Systems
3 min read
AI systems have API requirements that traditional applications do not. Model inference is computationally expensive and latency-variable. Responses may be streamed rather than returned in a single payload. Confidence scores accompany outputs. Rate limits are tighter because each call consumes compute resources. Designing APIs for AI systems requires patterns that account for these characteristics.
- Streaming Responses AI model outputs — especially from large language models — generate tokens sequentially. Server-Sent Events (SSE) or WebSockets allow the consumer to begin processing output before the full response is generated. The perceived latency drops dramatically. A 10-second generation that streams from the first token feels like a 200ms response to the user.
- Asynchronous Processing For expensive operations — document processing, batch inference, fine-tuning — the synchronous request-response pattern is inappropriate. Submit the job, return a job ID, and provide a status endpoint or callback webhook. The consumer polls for completion or receives a notification. This pattern prevents timeout cascades in the client and allows the backend to queue and prioritize work.
- Confidence and Metadata AI outputs should include confidence scores, model version, processing time, and token usage alongside the result. This metadata allows the consumer to make informed decisions — routing low-confidence outputs to human review, tracking cost per request, and detecting model version drift. The metadata is as valuable as the result.
Do This
- Stream responses for any AI operation with generation time over 2 seconds
- Use async patterns for batch operations and return job IDs instead of blocking
- Include confidence scores, model version, and token usage in every AI response
Avoid This
- Force synchronous request-response for 30-second AI operations — the consumer will time out
- Return AI outputs without confidence indicators — the consumer cannot make quality decisions without them
- Expose your internal AI pipeline structure through the API — abstract the complexity behind a clean interface