Home

Production ML Infrastructure Patterns

FastAPI, Docker, model serving, and monitoring patterns that actually scale.

Most ML infrastructure advice assumes you are Google or have Google-scale problems. You do not. You have a model that needs to serve predictions reliably, a budget that matters, and a team that cannot afford to maintain Kubernetes clusters and custom orchestration layers.

This is about the infrastructure patterns that work for the rest of us. The patterns that let you go from trained model to production endpoint in days, not months. The patterns that scale to thousands of requests per second before you need to think about distributed systems.

I have built ML systems for compliance screening, enterprise knowledge management, and industrial inspection. The infrastructure is never the interesting part, but it is always the part that determines whether the model actually delivers value. Here is what works.

Why FastAPI Won

Flask was the default for ML serving for years. Then FastAPI arrived and within two years became the obvious choice. Not because it is faster, though it is. Not because it has automatic OpenAPI documentation, though that matters. FastAPI won because it makes the correct choice the easy choice. Type hints are enforced at runtime via Pydantic. Request validation happens automatically. You cannot accidentally return the wrong schema without noticing immediately. This eliminates an entire class of bugs.

Consider a basic prediction endpoint. With Flask you write validation logic by hand, parse JSON manually, handle errors yourself, and hope your documentation stays synchronized with your code. With FastAPI you define a Pydantic model and the framework does the rest.

from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np

app = FastAPI()

class PredictionRequest(BaseModel):
    features: list[float]
    model_version: str = "latest"

class PredictionResponse(BaseModel):
    prediction: float
    confidence: float
    model_version: str

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    # Load model, make prediction
    # Validation, serialization, documentation: automatic
    return PredictionResponse(
        prediction=0.87,
        confidence=0.92,
        model_version="v1.2.0"
    )

This is not about saving lines of code. It is about reducing the surface area for errors. The request schema is the documentation is the validation logic is the type hint. Change one and you change all of them. This is how infrastructure should work.

FastAPI also handles async properly, which matters when your model serving includes external API calls, database lookups, or preprocessing steps that can run concurrently. You get proper async/await support without fighting the framework.

Docker Patterns for ML

Docker for ML is different from Docker for web applications. Your images are large because models are large. Your builds are slow because installing scientific Python packages is slow. You need GPU support, which means NVIDIA runtime configuration. And you need reproducibility, which means pinning everything.

Multi-stage builds are not optional. Your final image should contain the model and the serving code, not the training code, not the dataset, not the Jupyter notebooks. A typical pattern:

FROM python:3.11-slim as builder

WORKDIR /build
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt

FROM python:3.11-slim

WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY ./app ./app
COPY ./models ./models

ENV PATH=/root/.local/bin:$PATH

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

For GPU support you need the NVIDIA base images and proper runtime configuration. The dockerfile changes, but more importantly you need to ensure your deployment environment has nvidia-docker installed and configured. This is straightforward on a dedicated ML server, more complex in Kubernetes, and the reason many teams stick with CPU inference longer than optimal.

Layer caching matters. Put your dependencies before your code. Put your model weights in a separate layer from your serving code if they change at different rates. Use .dockerignore aggressively. A 3GB image that rebuilds in 30 seconds is better than a 1GB image that rebuilds in 5 minutes.

Model Versioning and Artifact Storage

You will have multiple model versions in production simultaneously. Not because you want to, but because rolling deployments exist, A/B tests exist, and rollback capability is not optional. I have seen teams avoid proper versioning and then face a production incident where they cannot determine which model version is serving predictions. The postmortem is never fun.

The simplest pattern that works: store models in S3 or equivalent object storage, version them with semantic versioning, and include the version in your prediction response. Your serving code loads the model on startup based on an environment variable or configuration file.

MODEL_VERSION=v1.2.0
MODEL_BUCKET=s3://ml-models/production
MODEL_PATH=${MODEL_BUCKET}/${MODEL_VERSION}/model.pkl

MLflow and similar tools add experiment tracking and model registry capabilities. Use them if you need them. But the core requirement is simpler: every prediction must be traceable to a specific model version, and you must be able to load any historical version on demand.

For artifact storage, S3 is the default choice because it works and pricing is predictable. For smaller models or lower latency requirements, baking the model into the Docker image works fine. The tradeoff is deployment speed versus image size. A 500MB model in the image means slower deployments but no startup-time download.

Monitoring: Drift Detection and Performance Tracking

Production ML monitoring has two distinct concerns: is the service running, and is the model still accurate. The first is standard infrastructure monitoring. The second requires ML-specific tooling.

For service health, track the usual metrics: request latency, error rate, throughput. Use Prometheus and Grafana or equivalent. Set alerts for latency above p95 thresholds, error rates above baseline, and memory usage trends.

For model health, track input distribution drift and prediction distribution drift. The former tells you if your input data has changed in ways the model has not seen. The latter tells you if your model behavior has changed. Both require storing samples of production inputs and predictions.

A basic drift detection pattern: log a random sample of inputs and predictions to a database or data warehouse. Run a daily job that compares the distribution of recent inputs to the training distribution using statistical tests like KS test or PSI. Alert if drift exceeds thresholds.

import numpy as np
from scipy.stats import ks_2samp

def detect_drift(training_data, production_data, threshold=0.05):
    """
    Detect distribution drift using Kolmogorov-Smirnov test.
    Returns True if significant drift detected.
    """
    statistic, pvalue = ks_2samp(training_data, production_data)
    return pvalue < threshold

For prediction quality, you need ground truth labels, which usually arrive delayed. Financial models get actuals when transactions settle. Fraud models get labels when investigations complete. Recommendation models get labels when users click or purchase.

The monitoring pattern: join predictions with delayed labels, compute accuracy metrics on a rolling window, alert on degradation. This is simpler than it sounds if you design your prediction logging schema to include a join key and timestamp.

Caching Strategies for Inference

Model inference is often CPU or GPU bound. Caching is how you serve 10x more requests on the same hardware. The question is what to cache and for how long.

For deterministic models with repeatable inputs, cache the predictions. A product categorization model will see the same product descriptions repeatedly. An NER model will see the same entity patterns. Use Redis or Memcached with a TTL based on model update frequency.

For embeddings or feature extraction, cache the intermediate representations. If your pipeline is embed-then-classify, cache the embeddings. The cache hit rate depends on input repetition, but for many applications 30-50% hit rates are common and eliminate the most expensive part of inference.

import hashlib
import redis

cache = redis.Redis(host='localhost', port=6379)

def cached_predict(input_text: str, model):
    cache_key = hashlib.sha256(input_text.encode()).hexdigest()

    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)

    prediction = model.predict(input_text)
    cache.setex(cache_key, 3600, json.dumps(prediction))
    return prediction

The cache invalidation strategy depends on your model update cadence. If you deploy new models weekly, set TTL to match. If you deploy multiple times per day, use version-aware cache keys that include model version. If model updates are rare, TTL can be measured in days.

Batch vs Real-Time Serving

Most ML workloads do not need real-time inference. They need results within seconds, minutes, or hours. The architecture difference between 10ms latency and 10-second latency is the difference between complex infrastructure and a Python script with a queue. Real-time here means low-latency synchronous, not streaming. Batch means async with higher latency tolerance. The terminology is imprecise but the distinction matters for architecture.

Real-time serving requires: a persistent service, load balancing, health checks, horizontal scaling, and careful optimization of model latency. You use FastAPI, Docker, and deploy behind a load balancer. Latency targets are p95 under 100ms for many applications.

Batch serving requires: a queue, worker processes, retry logic, and output storage. You use Celery or RQ, a message broker like Redis or RabbitMQ, and workers that pull tasks and write results. Latency targets are measured in seconds or minutes.

The batch pattern is simpler and cheaper. One worker can process thousands of predictions per hour. Scaling is adding more workers. Deployment is restarting workers. Rollback is deploying old code. No load balancer, no health checks, no coordination required.

Use real-time serving when users are waiting for the result or when the use case requires low latency. Fraud detection during checkout, content moderation before posting, real-time personalization. Use batch serving for everything else. Data enrichment, reporting, scheduled scoring, backfill jobs.

Many systems need both. Real-time API for user-facing predictions, batch jobs for bulk scoring and model evaluation. The infrastructure can share model artifacts and code while serving different latency requirements.

When Kubernetes is Overkill

Kubernetes solves orchestration problems you do not have until you have many services, complex dependencies, and scaling requirements that exceed what a few servers can handle. Most ML systems do not meet this bar for years.

A single EC2 instance or dedicated server with Docker and systemd can serve thousands of requests per second and handle failover via health checks and auto-restart. You lose some of Kubernetes' features but gain operational simplicity. No cluster management, no etcd, no kubectl debugging at 2am.

The threshold where Kubernetes makes sense: multiple ML services with different scaling requirements, frequent deployments that need zero-downtime rollout, or organizational requirements for platform consistency. Below that threshold, simpler is better.

Alternative patterns: Docker Compose for multi-container coordination, systemd for process management, nginx or HAProxy for load balancing, cloud provider load balancers for high availability. This stack handles most ML serving workloads and costs a fraction of the operational overhead.

Simple Patterns That Scale

The infrastructure patterns that work in production are boring. They use well-understood tools, avoid clever abstractions, and prioritize debuggability over feature richness. FastAPI for serving, Docker for packaging, object storage for artifacts, Redis for caching, Postgres for metadata.

Start with the simplest architecture that meets requirements. One service, one server, no orchestration. Add complexity only when you measure the need. Most ML systems never need more than this foundation plus monitoring and proper deployment automation.

The goal is not to build infrastructure. The goal is to deploy models that deliver value with minimal operational overhead. Every component you add is surface area for failure and complexity you must maintain. Choose boring technology, optimize when you have metrics that justify it, and focus your effort on the model and the problem domain.

Production ML infrastructure is not about using the latest tools or the most sophisticated architecture. It is about reliability, debuggability, and getting predictions to users quickly. The patterns that work are the patterns that you can understand, operate, and debug when things break at 3am.