Skip to main content

Command Palette

Search for a command to run...

How Uber Runs 30 Million ML Predictions Per Second on Kubernetes

Updated
14 min read

The architecture behind Michelangelo and what every engineering team can learn from it

Published by Insigh8s · Tech Stories · KubeCon EU 2026 Amsterdam

At KubeCon EU 2026 in Amsterdam, When I heard 30 million ML predictions per second at peak by Uber. Across roughly 1,000 serving nodes. Under 10ms P95 latency and thought: how does Kubernetes actually hold that together?

Because here's the thing. Kubernetes was not designed for this. It was built to orchestrate stateless web services. ML inference at Uber's scale is stateful, GPU-bound, latency-sensitive, and spans multiple clusters across multiple clouds and data centres. Getting Kubernetes to work at this scale required Uber to rethink almost every default assumption the platform makes.

This is the story of how they did it. Not the marketing version. The architecture version.

A Brief History: From Chaos to Michelangelo

Before 2016, Uber's ML story was familiar chaos. Data scientists built models on laptops using R, scikit-learn, or whatever suited them. When a model needed to go to production, a separate engineering team would build a bespoke serving container for it, from scratch, every time. No shared pipelines, no feature reuse, no reproducibility.

The result? ML impact was bottlenecked by whatever a handful of engineers could cobble together. Uber was scaling to tens of millions of trips per day and their ML infrastructure looked like a patchwork quilt.

In 2016, they launched Michelangelo, an internal ML-as-a-Service platform. The goal was simple to state and hard to execute: democratize ML across Uber so that any team could build, deploy, and operate models at full production scale.

Eight years later, that platform runs 100% of Uber's mission-critical ML workloads. Here's what its architecture looks like today, and more importantly, why the decisions were made the way they were.

The Architecture: Three Planes, One Platform

Michelangelo 2.0 is organized around three distinct planes. This separation is not cosmetic. Each plane has fundamentally different scaling characteristics, failure modes, and infrastructure requirements. Collapsing them together would make each one worse.

┌─────────────────────────────────────────────────────┐
│                   CONTROL PLANE                     │
│         Kubernetes Operator pattern (CRDs)          │
│   Manages lifecycle: models, jobs, deployments      │
└──────────────┬──────────────────────┬───────────────┘
               │                      │
               ▼                      ▼
┌──────────────────────┐  ┌───────────────────────────┐
│   OFFLINE DATA PLANE │  │    ONLINE DATA PLANE      │
│                      │  │                           │
│  Ray + Spark on K8s  │  │  NVIDIA Triton serving    │
│  20K training jobs/mo│  │  30M predictions/second   │
│  Batch inference     │  │  Real-time feature serving│
│  Model evaluation    │  │  Near-real-time features  │
└──────────────────────┘  └───────────────────────────┘
               │                      │
               └──────────┬───────────┘
                          ▼
          ┌───────────────────────────────┐
          │   MICHELANGELO JOB CONTROLLER │
          │   (Federation layer)          │
          │   Schedules across N clusters │
          └───────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
      K8s Cluster A   K8s Cluster B   K8s Cluster C
      (AZ-1, on-prem) (AZ-2, OCI)    (AZ-3, GCP)

Let's go through each layer in detail.

Layer 1: The Control Plane, Kubernetes Operators All the Way Down

The control plane is the brain of Michelangelo. It manages the lifecycle of every ML entity in the system: training jobs, model versions, deployments, serving configurations, and monitoring policies.

The key design decision here was to model the entire ML lifecycle as Kubernetes Custom Resources using the Operator pattern. Every ML concept, whether a training job, a model deployment, or a feature pipeline, becomes a CRD that the platform manages declaratively.

This is not a trivial decision. Here's what it actually means in practice.

What the Operator Pattern Gives You

In standard Kubernetes, you describe the desired state of your application in a YAML manifest. Kubernetes continuously reconciles actual state to match desired state. The Operator pattern extends this to arbitrary domain-specific objects.

For Michelangelo, that means an ML engineer can write something conceptually like this:

apiVersion: michelangelo.uber.com/v1
kind: ModelDeployment
metadata:
  name: deepeta-v3
  namespace: maps-team
spec:
  model:
    name: deepeta
    version: "3.2.1"
    framework: pytorch
  serving:
    replicas: 50
    resources:
      gpu: "1"
      memory: "32Gi"
    autoscaling:
      minReplicas: 20
      maxReplicas: 200
      targetLatencyP95Ms: 10
  featureStore:
    paletteSets:
      - trip-context-features
      - realtime-traffic-features
  trafficPolicy:
    canaryWeight: 5
    stableWeight: 95

And the Michelangelo Operator handles everything that happens next: pulling the model artifact, provisioning GPU nodes, routing traffic, wiring up the feature store, configuring autoscaling, and setting up monitoring. The engineer didn't write a single deployment script.

Why This Matters for Architecture Teams

The Operator pattern gives you operational knowledge encoded in software rather than in runbooks. The kubectl mental model your engineers already know maps directly onto ML lifecycle management. You get audit trails, GitOps compatibility, RBAC via standard Kubernetes primitives, and the ability to extend the platform without rewriting the core.

The alternative is bespoke orchestration scripts, Jenkins jobs, and custom APIs. That is what Uber had before. The Operator pattern is why they could go from dozens of ML use cases to 5,300 production models without proportionally scaling the platform team.

Comparison with Standard Kubernetes Patterns

Concern Standard Kubernetes Michelangelo Approach
Workload definition Deployment / StatefulSet Custom ModelDeployment CRD
Lifecycle management Manual kubectl / Helm Operator reconciliation loop
Scaling HPA on CPU/memory Custom metrics (latency, QPS)
Feature injection ConfigMap / Secret Palette feature store sidecar
Multi-cluster Manual federation Job Controller (see below)

Layer 2: The Offline Data Plane, Training at Scale with Ray on Kubernetes

The offline data plane handles everything that doesn't need to be real-time: model training, evaluation, batch inference, and data pipeline execution. This is where Uber's 20,000 monthly training jobs run.

Until 2023, this plane ran on Apache Mesos + Peloton, Uber's homegrown cluster manager. When they decided to move to Kubernetes, they made it not just a like-for-like migration but a full rethink of how training infrastructure should work.

Why They Moved from Mesos to Kubernetes

The old system had three critical problems.

Leaky abstraction. ML engineers had to be aware of infrastructure details: which region, which zone, which cluster had available GPU SKUs. That is not their job and the cognitive overhead compounded at scale.

Tight coupling. The serving infrastructure was tightly coupled to the underlying compute, making migrations painful and cloud portability nearly impossible.

Ecosystem mismatch. Both Ray and Spark, the two frameworks Uber relies on heavily, had developed native Kubernetes support. Staying on Mesos meant maintaining custom integrations indefinitely.

Kubernetes solved all three. But it introduced a new problem.

The Multi-Cluster Training Problem

Uber runs 5,000+ GPUs. A single Kubernetes cluster cannot hold all of them, both because of node count limits and because GPUs are distributed across availability zones and cloud providers for resilience.

This means a training job might need to run on Cluster A (on-prem, AZ-1) or Cluster B (OCI, AZ-2) or Cluster C (GCP, AZ-3) depending on which cluster has the right GPU SKU available right now.

The naive solution is to expose all of this to ML engineers. That's the old world and it doesn't scale.

Uber's solution is the Michelangelo Job Controller, a federation layer that sits above all the clusters. An engineer submits a JobSpec that describes what they need:

apiVersion: michelangelo.uber.com/v1
kind: TrainingJob
metadata:
  name: deepeta-training-run-472
spec:
  framework: ray
  resources:
    instanceType: a100-80gb
    gpuCount: 64
    memoryGb: 512
  code:
    image: michelangelo/deepeta-trainer:3.2.0
  data:
    paletteFeatureSets:
      - trip-features-v4
    hiveTable: trips.training_data_2024
  output:
    modelRegistry: gallery
    experimentId: deepeta-q1-2025

The Job Controller determines which cluster to run on, handles scheduling across the federation, and abstracts all infrastructure details away from the engineer. Engineers think about workloads, not clusters.

The training stack itself uses Ray + PyTorch + DeepSpeed + Hugging Face, all running natively on Kubernetes via Ray's Kubernetes operator (KubeRay). This gives them dynamic worker scaling, fault tolerance, and the ability to leverage spot/preemptible instances for cost optimization on non-critical training runs.

Layer 3: The Online Data Plane, Serving 30M Predictions Per Second

This is the layer that produces the headline number. And it's where the most interesting engineering decisions were made.

Replacing the Serving Engine with NVIDIA Triton

Michelangelo 2.0 replaced its previous custom serving engine with NVIDIA Triton Inference Server. Triton is open source, supports TensorFlow, PyTorch, XGBoost, and TensorRT from a single runtime, and integrates natively with Kubernetes.

The key advantage for Uber was a single serving runtime across all model types. Previously, different model frameworks required different serving containers. That meant maintaining multiple code paths, multiple Docker images, and multiple operational runbooks. Triton collapses this into a single abstraction.

The serving architecture looks like this:

Incoming prediction request (e.g. from rides-matching service)
        │
        ▼
   Load Balancer / Service Mesh (Envoy)
        │
        ▼
   Triton Serving Pod (GPU node)
   ┌─────────────────────────────────┐
   │  Model: deepeta v3.2.1          │
   │  Framework: PyTorch             │
   │                                 │
   │  ┌───────────────────────────┐  │
   │  │  Feature enrichment       │  │
   │  │  (Palette sidecar)        │  │
   │  └───────────────────────────┘  │
   │            │                    │
   │            ▼                    │
   │  ┌───────────────────────────┐  │
   │  │  Model inference (GPU)    │  │
   │  │  P95 latency target: 10ms │  │
   │  └───────────────────────────┘  │
   └─────────────────────────────────┘
        │
        ▼
   Prediction response

Elastic CPU/GPU Sharing: The Efficiency Unlock

One of the most operationally clever decisions Uber made was elastic resource sharing between training and serving.

Serving pods are optimized for latency. They need consistent, reserved GPU capacity. But at 3am, when trip demand is low, those GPUs sit mostly idle. Training jobs are throughput-oriented and can tolerate some preemption.

Michelangelo implements a reactive scheduling policy: idle serving capacity gets opportunistically allocated to training workloads during off-peak hours, then reclaimed when serving demand rises. This is not Kubernetes bin-packing out of the box. It requires custom scheduling logic in the Job Controller that is aware of serving SLAs.

The result is significantly higher overall GPU utilization. GPUs are expensive. Letting them sit idle at 3am is throwing money away.

Autoscaling: Not Just HPA

Standard Kubernetes HPA scales on CPU and memory. For ML serving, those metrics are nearly meaningless. A GPU can be at 30% utilization while serving at full capacity, and scaling on CPU would give you the wrong answer entirely.

Uber scales their serving pods on inference-specific metrics: requests per second, P95 latency, and GPU queue depth. These are exposed via Prometheus and consumed by a custom autoscaler that understands the difference between "GPU is underutilized" and "we're approaching saturation."

A simplified version of this HPA configuration looks like this:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: deepeta-serving-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: deepeta-triton
  minReplicas: 20
  maxReplicas: 200
  metrics:
  - type: Pods
    pods:
      metric:
        name: triton_inference_request_duration_p95_ms
      target:
        type: AverageValue
        averageValue: "8"   # target 8ms, alert at 10ms
  - type: Pods
    pods:
      metric:
        name: triton_inference_queue_duration_us
      target:
        type: AverageValue
        averageValue: "1000"  # queue time under 1ms

The Feature Store: Solving Training-Serving Skew

No discussion of Michelangelo's architecture is complete without Palette, Uber's centralized feature store. This is the piece that makes the whole system actually produce accurate predictions.

Training-serving skew is one of the most common silent killers of ML model accuracy. Your model trains on features computed one way (in batch, offline), but at serving time the same features are computed differently (in real-time, online). The predictions degrade subtly and the root cause is nearly impossible to debug.

Palette solves this with a dual-store architecture:

Feature Definition (written once, in DSL)
        │
        ├──────────────────────────────────┐
        ▼                                  ▼
  OFFLINE STORE                      ONLINE STORE
  (HDFS / Hive)                      (Cassandra)
  Batch computation                  Real-time serving
  for training data                  <5ms reads
        │                                  │
        ▼                                  ▼
  Model training                    Model serving
  (same features)                   (same features)

The same feature definition code runs in both contexts. Training and serving are guaranteed to compute features identically. Uber has over 20,000 features in Palette, shared across teams. When the rides team computes a "driver average rating in last 30 days" feature, the Eats team can reuse it directly rather than reimplementing it.

On Kubernetes, Palette features are injected into serving pods as a sidecar that handles real-time feature lookups from Cassandra. The serving pod doesn't need to know where features come from. It just calls the sidecar.

What This Means for Your Organisation

You are almost certainly not operating at Uber's scale. But the architectural patterns here apply at 10 clusters as much as at 1,000. Here's how to think about each layer for your own platform.

1. Adopt the Operator pattern early, not retroactively

If you are building an internal ML platform or even just standardizing how teams deploy models, the Operator pattern is the right foundation. The effort to implement it early is far less than the effort to migrate to it later. Start with a ModelDeployment CRD and a simple reconciler. Even a basic one eliminates entire categories of toil.

Tools to evaluate: Kubeflow, KServe, and BentoML all implement Operator-based ML lifecycle management and can give you this foundation without building from scratch.

2. Separate your control plane from your data planes

This is a general distributed systems principle that Michelangelo applies well. Things with different scaling characteristics and failure modes should be separated. Your lifecycle management system (control plane) should not go down when your serving infrastructure is under load (data plane). Keep them independent.

3. Address multi-cluster before you need it

Most teams add multi-cluster support reactively, when they're already in pain. The federation layer concept, a scheduler that sits above your clusters and abstracts away their heterogeneity, is worth designing for even if you start with a single cluster. It will save you from having to re-architect your scheduling logic later.

Tools worth evaluating: Kueue (CNCF project, now adopted by Google for batch ML workloads), Volcano, and YuniKorn as starting points for multi-cluster-aware scheduling.

4. Use Triton if you have multiple model frameworks

If your organisation runs more than one model type (XGBoost for tabular data plus PyTorch for deep learning is extremely common), a unified serving runtime eliminates massive operational overhead. Triton is open source, Kubernetes-native, and battle-tested at Uber scale.

5. Solve training-serving skew before it bites you

If you don't have a feature store, you have training-serving skew. You just don't know it yet. The dual-store pattern (batch offline store for training, low-latency online store for serving, same feature definitions for both) is the solution. Open-source options include Feast (CNCF), Hopsworks, and Tecton on the commercial side.

6. Scale on inference metrics, not CPU

If you're running GPU workloads on Kubernetes and scaling on CPU, you are scaling on the wrong signal. Expose Triton's Prometheus metrics and build your HPA around request latency and queue depth instead.

The Honest Takeaways

A few things worth noting that don't make it into keynote slides.

They built a lot of this themselves. The Michelangelo Job Controller (federation layer) is not an open-source project you can deploy. It's proprietary engineering built over years. If you want this capability, you're either building it, buying it (CAST.ai, Run:ai), or starting with Kueue and extending it.

The migration from Mesos to Kubernetes was years of work. Uber's Kubernetes story is not "we switched and everything was better." It was a multi-year migration with custom integrations for Ray, Spark, and their internal tooling. The payoff was real, but so was the cost.

The 30M/s number is a peak stat. Average throughput is lower. Peak is what your architecture has to be designed for, but don't mistake peak for typical.

Kubernetes was not sufficient alone. Every layer described in this post represents work Uber did above Kubernetes: the federation layer, the custom autoscaler, the Operator, the Palette sidecar. Kubernetes is the foundation, not the ceiling. Your ML platform will require the same philosophy.

Further Reading

All the architecture details in this post are grounded in Uber's own engineering publications. If you want to go deeper:

Insigh8s (i8s) is an open-source Kubernetes reference architecture project, an opinionated CNCF-only blueprint for building production-grade Kubernetes platforms. Tech Stories is a recurring series covering how leading engineering organisations solve real infrastructure problems at scale.

19 views

Tech Stories

Part 1 of 1