<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Insigh8s]]></title><description><![CDATA[Insigh8s]]></description><link>https://blog.insigh8s.io</link><image><url>https://cdn.hashnode.com/uploads/logos/668ed26b53ecf1ff0c31457d/35fe8d44-9c9b-4867-a64e-31dcd3817b93.png</url><title>Insigh8s</title><link>https://blog.insigh8s.io</link></image><generator>RSS for Node</generator><lastBuildDate>Fri, 24 Apr 2026 15:59:10 GMT</lastBuildDate><atom:link href="https://blog.insigh8s.io/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Insigh8s : An opinionated MCP server for Kubernetes triage]]></title><description><![CDATA[It's 2am. Your phone goes off. A cost alert fires for the payments namespace: 23% over budget for the week. You open your laptop.
You check the Grafana dashboard. Spend is up, but it doesn't say why. ]]></description><link>https://blog.insigh8s.io/insigh8s-an-opinionated-mcp-server-for-kubernetes-triage</link><guid isPermaLink="true">https://blog.insigh8s.io/insigh8s-an-opinionated-mcp-server-for-kubernetes-triage</guid><category><![CDATA[Kubernetes]]></category><category><![CDATA[mcp]]></category><category><![CDATA[finops]]></category><category><![CDATA[Devops]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Ashutosh Rathore]]></dc:creator><pubDate>Sun, 19 Apr 2026 11:19:44 GMT</pubDate><content:encoded><![CDATA[<hr />
<p>It's 2am. Your phone goes off. A cost alert fires for the <code>payments</code> namespace: 23% over budget for the week. You open your laptop.</p>
<p>You check the Grafana dashboard. Spend is up, but it doesn't say why. You switch to OpenCost: fine-grained allocation, but no correlation with what changed. You <code>kubectl get pods -n payments</code> to see what's running. Forty pods. Which ones matter? You check the last few Argo CD syncs. Something was deployed three days ago. Was that it?</p>
<p>Then the security on-call pings you: there's a PCI audit next quarter, and three of those pods are running as root. They've been like that for weeks but nobody flagged it. Now it's blocking the audit.</p>
<p>So you're sitting there at 2am, with five browser tabs open, correlating five sources of data by hand, trying to figure out: <strong>what actually broke, what does it cost, what's the priority, and what's the fix?</strong></p>
<p>This is the job. Every platform engineer, SRE, FinOps lead, and security reviewer knows this flow. The tools are all there. The data is all there. The gap is everything that lives between the tools.</p>
<p>That gap is what <a href="https://insigh8s.io"><strong>Insigh8s</strong></a> is trying to fill.</p>
<h2>The problem isn't visibility. It's judgement.</h2>
<p>Here's the thing nobody says out loud: your AI assistant can already see all of this data.</p>
<p>Claude Desktop can install the Kubernetes MCP server and run <code>kubectl</code> for you. There's an AKS MCP server. OpenCost ships with a built-in MCP on port 8081. Kyverno's PolicyReports are CRDs that any MCP can query. Even Microsoft Sentinel now has MCP support for security data.</p>
<p>You can wire a dozen MCPs to your AI today. Most teams aren't doing it, but the capability exists.</p>
<p><strong>What you can't wire in is the judgement.</strong> Knowing:</p>
<ul>
<li><p>Which of those data sources to query for a given question</p>
</li>
<li><p>How to join results across them (pod name → cost allocation → policy violation → recent deploy → network flow)</p>
</li>
<li><p>What thresholds actually matter (is 72% CPU headroom "waste"? what about 35%?)</p>
</li>
<li><p>How to rank findings by what will actually hurt if you don't fix it today</p>
</li>
<li><p>What a copy-paste remediation looks like for each category</p>
</li>
</ul>
<p>That judgement still lives in the head of the engineer on-call at 2am. Every 2am. For every incident. Forever.</p>
<p>LLMs can get better at reasoning, but they can't learn <em>your</em> cluster's specific triage playbook just by getting smarter. And even if they could, you probably don't want "the AI decided what to do this time" as your answer for a PCI auditor.</p>
<h2>What Insigh8s is</h2>
<p>Insigh8s is one MCP server. Not a SaaS, not a dashboard, not an AI agent. A single open-source program you run on your cluster that registers itself once with your AI assistant.</p>
<p>It exposes a handful of <strong>composite tools</strong>: functions like <code>audit_namespace</code>, <code>find_expensive_violators</code>, and <code>rightsize_workloads</code>. Each one corresponds to a triage question a human would actually ask.</p>
<p>Here's the mental model I find useful: <strong>think of Insigh8s like a SQL view.</strong></p>
<p>In a database, you could write a complex JOIN across five tables every time you want an answer. Or you could create a view like <code>v_namespace_audit</code> that encodes the join once, tested and named. Anyone querying the view gets the same clean answer. The DBA who wrote the view encoded their understanding of the schema once, and every user benefits.</p>
<p>Insigh8s is the view. Your AI assistant is the SQL client. The underlying systems (kubectl, OpenCost, Prometheus, Kyverno, Hubble) are the tables.</p>
<h2>The three v0.1 tools, one per intent</h2>
<p>Early on, we tried to build one composite tool that answered everything: audit, cost, triage, compliance, all from a single call. It felt clever. It was also wrong.</p>
<p>Different people asking different questions want different answers. A security reviewer auditing for PCI doesn't want pod restart counts mixed into the report. An SRE at 2am doesn't want a compliance score. A FinOps analyst asking about spend doesn't want a network flow summary. Bundling those concerns into one tool creates an output that's noisy for every caller and clean for none.</p>
<p>So v0.1 ships three composite tools, each answering one clear question:</p>
<ul>
<li><p><code>investigate_namespace(namespace, window)</code> → SRE workflow. What's wrong and why?</p>
</li>
<li><p><code>namespace_cost(namespace, window)</code> → FinOps workflow. What does this cost?</p>
</li>
<li><p><code>audit_namespace(namespace, framework)</code> → compliance workflow. Is this compliant with [framework X]?</p>
</li>
</ul>
<p>Plus one helper, <code>list_audit_frameworks()</code>, which the AI calls when the user asks for an audit without specifying which framework they mean.</p>
<h2>What <code>investigate_namespace</code> actually does</h2>
<p>Let's take the SRE tool, since it's the one most people will hit first. Say you call <code>investigate_namespace("payments", window="15m")</code>. Here's what happens inside Insigh8s's code, not in your AI's reasoning:</p>
<ol>
<li><p>Shell out to <code>kubectl get pods -n payments</code> and classify each pod as healthy, degraded (recent restarts), or failed (CrashLoopBackOff, ImagePullBackOff, Pending too long)</p>
</li>
<li><p>For each failed pod, pull the last 20 lines of logs and look for known error patterns (OOMKilled, context deadline, connection refused, DNS failures)</p>
</li>
<li><p>Query <code>kubectl get deployments -n payments</code> with revision history, find any deploy that landed within the window</p>
</li>
<li><p>Correlate: did the failures start <em>after</em> the most recent deploy? If yes, flag it as the likely cause</p>
</li>
<li><p>PromQL query for error rate delta: compare last 15 minutes to the hour before</p>
</li>
<li><p>Optionally, if Hubble is installed, pull flow logs for unusual rejection patterns</p>
</li>
<li><p>Optionally, query <code>PolicyReport</code> CRDs for admission denials in the window</p>
</li>
</ol>
<p>Then, in Go code:</p>
<ol>
<li><p>Join findings by pod and workload</p>
</li>
<li><p>Rank by severity and blast radius</p>
</li>
<li><p>Generate a concise, prioritized summary with suggested next steps (rollback command, pod to describe, log lines to investigate)</p>
</li>
</ol>
<p>Your AI assistant makes one tool call. The orchestration, the joins, the pattern recognition: all of that lives in reviewable, versionable, testable Go code. Not in the LLM's interpretation of your prompt.</p>
<h2>What <code>namespace_cost</code> does</h2>
<p>Completely separate tool, deliberately narrow. Given a namespace and a window, it hits OpenCost's allocation API, returns spend broken down by workload, computes week-over-week delta, and ranks the top cost drivers.</p>
<p>No policy findings. No rightsizing recommendations (that's a future tool, <code>find_waste</code>). No investigation signals. Just cost, because that's what the caller asked for.</p>
<h2>What <code>audit_namespace</code> does</h2>
<p>Also separate, also deliberately narrow. Takes a namespace and a <strong>required framework parameter</strong>. v0.1 supports two frameworks:</p>
<ul>
<li><p><code>pod-security-standards-restricted</code>: the upstream Kubernetes Pod Security Standards, Restricted profile</p>
</li>
<li><p><code>cis-kubernetes-benchmark</code>: CIS's widely-recognized Kubernetes hardening spec</p>
</li>
</ul>
<p>The tool checks each relevant control for the framework, lists which pods or containers violate it, and suggests remediation (a Kyverno policy, a <code>kubectl patch</code> command, or a YAML edit).</p>
<p>What if the user says <em>"audit the payments namespace"</em> without naming a framework? Your AI assistant calls <code>list_audit_frameworks()</code> first, reads the options, and asks the user which one they want. That's the right division of labor: Insigh8s provides the capabilities, your AI handles the conversation.</p>
<p>More frameworks (SOC2 CC6, ISO 27001 A.12, PCI-DSS 4, NIST 800-190) are planned for v0.2+. Those require more interpretation and sometimes external context, so they're second-wave work.</p>
<h2>Why this matters even as AI gets better</h2>
<p>The obvious objection: <em>"Won't Claude 7 or GPT-6 just do all of this natively?"</em></p>
<p>Maybe. But that's missing what this kind of tool actually is.</p>
<p>Cursor still exists even though Claude writes code. k9s still exists even though kubectl works fine. Terraform still exists even though every cloud has a web console. The "raw capability exists in the platform" vs "opinionated product for teams" distinction is durable.</p>
<p>When an enterprise ops team handles a real incident, they don't want "what did the AI decide to do this time." They want:</p>
<ul>
<li><p><strong>Deterministic answers.</strong> Same question today and tomorrow.</p>
</li>
<li><p><strong>Versioned tools.</strong> <code>audit_namespace</code> v1.2 is a diff you can review.</p>
</li>
<li><p><strong>Compliance-readable code.</strong> Your security team can read what gets checked.</p>
</li>
<li><p><strong>Works offline.</strong> Because Claude's API goes down sometimes, and you still have an incident.</p>
</li>
<li><p><strong>Consistent across models.</strong> Claude, GPT, Gemini, local Llama all give the same answer because they're all calling the same code.</p>
</li>
</ul>
<p>As LLMs take on more critical work, the pressure for this kind of reviewability increases, not decreases.</p>
<h2>Who this is for</h2>
<p>Four audiences, each mapped to the tool that answers their question:</p>
<h3>Developers → <code>investigate_namespace</code></h3>
<p><em>"My deploy failed. What broke?"</em> → one tool call returns the failing pod, the deploy that preceded the failure, the error pattern in the logs, and the likely fix. Instead of digging through kubectl, logs, and recent git commits in three separate tools.</p>
<h3>Platform / SRE → <code>investigate_namespace</code></h3>
<p><em>"Something's wrong in the payments namespace."</em> → one tool call returns unhealthy pods, recent deploys that correlate with the problems, log error patterns, error rate changes, and unusual flows. Your 2am triage playbook, encoded. Same tool the developer uses, because "what's broken?" is fundamentally the same question regardless of role.</p>
<h3>FinOps → <code>namespace_cost</code></h3>
<p><em>"What is the payments namespace costing us?"</em> → one tool call returns spend broken down by workload, week-over-week delta, and top cost drivers. Clean output focused on the money question. For rightsizing and waste-hunting specifically, future tools will land in v0.2 (<code>find_waste</code>, <code>idle_workloads</code>).</p>
<h3>Security and auditors → <code>audit_namespace</code></h3>
<p><em>"Audit the payments namespace against CIS Kubernetes Benchmark."</em> → one tool call returns a compliance report: which controls pass, which fail, which pods are the violators, and what the remediation looks like. v0.1 supports Pod Security Standards (Restricted) and CIS Kubernetes Benchmark. If you don't name a framework, your AI asks which one you want.</p>
<p>The same principle underlies all of them: <strong>one tool per intent, no god-tool.</strong> An SRE investigating a problem should get investigation output. A FinOps engineer asking about cost should get cost output. A compliance reviewer running an audit should get audit output. No tool tries to answer every question at once.</p>
<h2>What's next</h2>
<p>Insigh8s is a work in progress. <strong>v0.1 is coming soon as open-source</strong>, Apache 2.0, with three composite tools (<code>investigate_namespace</code>, <code>namespace_cost</code>, <code>audit_namespace</code>) plus <code>list_audit_frameworks</code> as a helper. Supports AKS, EKS, GKE, or any CNCF-conformant cluster. No telemetry, no phone-home, no SaaS tier.</p>
<p>v0.2 and beyond will expand each intent: <code>find_waste</code> and <code>idle_workloads</code> on the FinOps side, more audit frameworks (SOC2, ISO 27001, PCI-DSS, NIST), and <code>investigate_pod</code> and <code>trace_latency</code> for deeper SRE work. The roadmap is public on GitHub and genuinely open to input.</p>
<p>If the composite-tool approach resonates with you, there are three ways to get involved early:</p>
<ol>
<li><p><strong>Star the repo</strong> at <a href="https://github.com/insigh8s">github.com/insigh8s</a> to follow progress and see the roadmap take shape.</p>
</li>
<li><p><strong>Shape the design</strong> by joining the GitHub Discussions. The early decisions (which tools to ship next, what the default thresholds should be, how to handle graceful degradation when Kyverno or OpenCost aren't installed) are still open.</p>
</li>
<li><p><strong>Write code.</strong> Good-first-issue labels, a simple Go architecture, and a roadmap you can pick from.</p>
</li>
</ol>
<p>If you just want to be notified when v0.1 drops, there's a minimal email form at <a href="https://insigh8s.io">insigh8s.io</a>. One email when it ships. Nothing else.</p>
<h2>One last thing</h2>
<p>If you've ever been the human correlation engine at 2am, you already know why this has to exist. The tools have always been there. The data has always been there. What's been missing is something that encodes the judgement (the knowing-which-thing-matters part) into code that your AI can call once and get right.</p>
<p>That's Insigh8s. An opinionated MCP for Kubernetes triage, built by people who've done the 2am correlation by hand enough times to want something better.</p>
<p>Come build it with us.</p>
<hr />
<p><em>Insigh8s is a community project, open-source under Apache 2.0. Website:</em> <a href="https://insigh8s.io"><em>insigh8s.io</em></a><em>. GitHub:</em> <a href="https://github.com/insigh8s"><em>github.com/insigh8s</em></a><em>. Blog:</em> <a href="https://blog.insigh8s.io"><em>blog.insigh8s.io</em></a><em>.</em></p>
]]></content:encoded></item><item><title><![CDATA[How Uber Runs 30 Million ML Predictions Per Second on Kubernetes]]></title><description><![CDATA[The architecture behind Michelangelo and what every engineering team can learn from it
Published by Insigh8s · Tech Stories · KubeCon EU 2026 Amsterdam
At KubeCon EU 2026 in Amsterdam, When I heard 30]]></description><link>https://blog.insigh8s.io/tech-stories-how-uber-runs-30-million-ml-predictions-per-second-on-kubernetes</link><guid isPermaLink="true">https://blog.insigh8s.io/tech-stories-how-uber-runs-30-million-ml-predictions-per-second-on-kubernetes</guid><dc:creator><![CDATA[Ashutosh Rathore]]></dc:creator><pubDate>Thu, 02 Apr 2026 10:45:53 GMT</pubDate><content:encoded><![CDATA[<p><strong>The architecture behind Michelangelo and what every engineering team can learn from it</strong></p>
<p><em>Published by Insigh8s · Tech Stories · KubeCon EU 2026 Amsterdam</em></p>
<p>At KubeCon EU 2026 in Amsterdam, When I heard <strong>30 million ML predictions per second at peak by Uber. Across roughly 1,000 serving nodes. Under 10ms P95 latency</strong> and thought: <em>how does Kubernetes actually hold that together?</em></p>
<p>Because here's the thing. Kubernetes was not designed for this. It was built to orchestrate stateless web services. ML inference at Uber's scale is stateful, GPU-bound, latency-sensitive, and spans multiple clusters across multiple clouds and data centres. Getting Kubernetes to work at this scale required Uber to rethink almost every default assumption the platform makes.</p>
<p>This is the story of how they did it. Not the marketing version. The architecture version.</p>
<h2>A Brief History: From Chaos to Michelangelo</h2>
<p>Before 2016, Uber's ML story was familiar chaos. Data scientists built models on laptops using R, scikit-learn, or whatever suited them. When a model needed to go to production, a separate engineering team would build a bespoke serving container for it, from scratch, every time. No shared pipelines, no feature reuse, no reproducibility.</p>
<p>The result? ML impact was bottlenecked by whatever a handful of engineers could cobble together. Uber was scaling to tens of millions of trips per day and their ML infrastructure looked like a patchwork quilt.</p>
<p>In 2016, they launched <strong>Michelangelo</strong>, an internal ML-as-a-Service platform. The goal was simple to state and hard to execute: <em>democratize ML across Uber so that any team could build, deploy, and operate models at full production scale.</em></p>
<p>Eight years later, that platform runs 100% of Uber's mission-critical ML workloads. Here's what its architecture looks like today, and more importantly, <em>why</em> the decisions were made the way they were.</p>
<h2>The Architecture: Three Planes, One Platform</h2>
<p>Michelangelo 2.0 is organized around three distinct planes. This separation is not cosmetic. Each plane has fundamentally different scaling characteristics, failure modes, and infrastructure requirements. Collapsing them together would make each one worse.</p>
<pre><code class="language-plaintext">┌─────────────────────────────────────────────────────┐
│                   CONTROL PLANE                     │
│         Kubernetes Operator pattern (CRDs)          │
│   Manages lifecycle: models, jobs, deployments      │
└──────────────┬──────────────────────┬───────────────┘
               │                      │
               ▼                      ▼
┌──────────────────────┐  ┌───────────────────────────┐
│   OFFLINE DATA PLANE │  │    ONLINE DATA PLANE      │
│                      │  │                           │
│  Ray + Spark on K8s  │  │  NVIDIA Triton serving    │
│  20K training jobs/mo│  │  30M predictions/second   │
│  Batch inference     │  │  Real-time feature serving│
│  Model evaluation    │  │  Near-real-time features  │
└──────────────────────┘  └───────────────────────────┘
               │                      │
               └──────────┬───────────┘
                          ▼
          ┌───────────────────────────────┐
          │   MICHELANGELO JOB CONTROLLER │
          │   (Federation layer)          │
          │   Schedules across N clusters │
          └───────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
      K8s Cluster A   K8s Cluster B   K8s Cluster C
      (AZ-1, on-prem) (AZ-2, OCI)    (AZ-3, GCP)
</code></pre>
<p>Let's go through each layer in detail.</p>
<h2>Layer 1: The Control Plane, Kubernetes Operators All the Way Down</h2>
<p>The control plane is the brain of Michelangelo. It manages the lifecycle of every ML entity in the system: training jobs, model versions, deployments, serving configurations, and monitoring policies.</p>
<p>The key design decision here was to model the entire ML lifecycle as <strong>Kubernetes Custom Resources</strong> using the Operator pattern. Every ML concept, whether a training job, a model deployment, or a feature pipeline, becomes a CRD that the platform manages declaratively.</p>
<p>This is not a trivial decision. Here's what it actually means in practice.</p>
<h3>What the Operator Pattern Gives You</h3>
<p>In standard Kubernetes, you describe the desired state of your application in a YAML manifest. Kubernetes continuously reconciles actual state to match desired state. The Operator pattern extends this to arbitrary domain-specific objects.</p>
<p>For Michelangelo, that means an ML engineer can write something conceptually like this:</p>
<pre><code class="language-yaml">apiVersion: michelangelo.uber.com/v1
kind: ModelDeployment
metadata:
  name: deepeta-v3
  namespace: maps-team
spec:
  model:
    name: deepeta
    version: "3.2.1"
    framework: pytorch
  serving:
    replicas: 50
    resources:
      gpu: "1"
      memory: "32Gi"
    autoscaling:
      minReplicas: 20
      maxReplicas: 200
      targetLatencyP95Ms: 10
  featureStore:
    paletteSets:
      - trip-context-features
      - realtime-traffic-features
  trafficPolicy:
    canaryWeight: 5
    stableWeight: 95
</code></pre>
<p>And the Michelangelo Operator handles everything that happens next: pulling the model artifact, provisioning GPU nodes, routing traffic, wiring up the feature store, configuring autoscaling, and setting up monitoring. The engineer didn't write a single deployment script.</p>
<h3>Why This Matters for Architecture Teams</h3>
<p>The Operator pattern gives you <strong>operational knowledge encoded in software</strong> rather than in runbooks. The <code>kubectl</code> mental model your engineers already know maps directly onto ML lifecycle management. You get audit trails, GitOps compatibility, RBAC via standard Kubernetes primitives, and the ability to extend the platform without rewriting the core.</p>
<p>The alternative is bespoke orchestration scripts, Jenkins jobs, and custom APIs. That is what Uber had before. The Operator pattern is why they could go from dozens of ML use cases to 5,300 production models without proportionally scaling the platform team.</p>
<h3>Comparison with Standard Kubernetes Patterns</h3>
<table>
<thead>
<tr>
<th>Concern</th>
<th>Standard Kubernetes</th>
<th>Michelangelo Approach</th>
</tr>
</thead>
<tbody><tr>
<td>Workload definition</td>
<td>Deployment / StatefulSet</td>
<td>Custom ModelDeployment CRD</td>
</tr>
<tr>
<td>Lifecycle management</td>
<td>Manual kubectl / Helm</td>
<td>Operator reconciliation loop</td>
</tr>
<tr>
<td>Scaling</td>
<td>HPA on CPU/memory</td>
<td>Custom metrics (latency, QPS)</td>
</tr>
<tr>
<td>Feature injection</td>
<td>ConfigMap / Secret</td>
<td>Palette feature store sidecar</td>
</tr>
<tr>
<td>Multi-cluster</td>
<td>Manual federation</td>
<td>Job Controller (see below)</td>
</tr>
</tbody></table>
<h2>Layer 2: The Offline Data Plane, Training at Scale with Ray on Kubernetes</h2>
<p>The offline data plane handles everything that doesn't need to be real-time: model training, evaluation, batch inference, and data pipeline execution. This is where Uber's 20,000 monthly training jobs run.</p>
<p>Until 2023, this plane ran on <strong>Apache Mesos + Peloton</strong>, Uber's homegrown cluster manager. When they decided to move to Kubernetes, they made it not just a like-for-like migration but a full rethink of how training infrastructure should work.</p>
<h3>Why They Moved from Mesos to Kubernetes</h3>
<p>The old system had three critical problems.</p>
<p><strong>Leaky abstraction.</strong> ML engineers had to be aware of infrastructure details: which region, which zone, which cluster had available GPU SKUs. That is not their job and the cognitive overhead compounded at scale.</p>
<p><strong>Tight coupling.</strong> The serving infrastructure was tightly coupled to the underlying compute, making migrations painful and cloud portability nearly impossible.</p>
<p><strong>Ecosystem mismatch.</strong> Both Ray and Spark, the two frameworks Uber relies on heavily, had developed native Kubernetes support. Staying on Mesos meant maintaining custom integrations indefinitely.</p>
<p>Kubernetes solved all three. But it introduced a new problem.</p>
<h3>The Multi-Cluster Training Problem</h3>
<p>Uber runs 5,000+ GPUs. A single Kubernetes cluster cannot hold all of them, both because of node count limits and because GPUs are distributed across availability zones and cloud providers for resilience.</p>
<p>This means a training job might need to run on Cluster A (on-prem, AZ-1) or Cluster B (OCI, AZ-2) or Cluster C (GCP, AZ-3) depending on which cluster has the right GPU SKU available right now.</p>
<p>The naive solution is to expose all of this to ML engineers. That's the old world and it doesn't scale.</p>
<p>Uber's solution is the <strong>Michelangelo Job Controller</strong>, a federation layer that sits above all the clusters. An engineer submits a <code>JobSpec</code> that describes what they need:</p>
<pre><code class="language-yaml">apiVersion: michelangelo.uber.com/v1
kind: TrainingJob
metadata:
  name: deepeta-training-run-472
spec:
  framework: ray
  resources:
    instanceType: a100-80gb
    gpuCount: 64
    memoryGb: 512
  code:
    image: michelangelo/deepeta-trainer:3.2.0
  data:
    paletteFeatureSets:
      - trip-features-v4
    hiveTable: trips.training_data_2024
  output:
    modelRegistry: gallery
    experimentId: deepeta-q1-2025
</code></pre>
<p>The Job Controller determines which cluster to run on, handles scheduling across the federation, and abstracts all infrastructure details away from the engineer. <strong>Engineers think about workloads, not clusters.</strong></p>
<p>The training stack itself uses Ray + PyTorch + DeepSpeed + Hugging Face, all running natively on Kubernetes via Ray's Kubernetes operator (<code>KubeRay</code>). This gives them dynamic worker scaling, fault tolerance, and the ability to leverage spot/preemptible instances for cost optimization on non-critical training runs.</p>
<h2>Layer 3: The Online Data Plane, Serving 30M Predictions Per Second</h2>
<p>This is the layer that produces the headline number. And it's where the most interesting engineering decisions were made.</p>
<h3>Replacing the Serving Engine with NVIDIA Triton</h3>
<p>Michelangelo 2.0 replaced its previous custom serving engine with <strong>NVIDIA Triton Inference Server</strong>. Triton is open source, supports TensorFlow, PyTorch, XGBoost, and TensorRT from a single runtime, and integrates natively with Kubernetes.</p>
<p>The key advantage for Uber was a single serving runtime across all model types. Previously, different model frameworks required different serving containers. That meant maintaining multiple code paths, multiple Docker images, and multiple operational runbooks. Triton collapses this into a single abstraction.</p>
<p>The serving architecture looks like this:</p>
<pre><code class="language-plaintext">Incoming prediction request (e.g. from rides-matching service)
        │
        ▼
   Load Balancer / Service Mesh (Envoy)
        │
        ▼
   Triton Serving Pod (GPU node)
   ┌─────────────────────────────────┐
   │  Model: deepeta v3.2.1          │
   │  Framework: PyTorch             │
   │                                 │
   │  ┌───────────────────────────┐  │
   │  │  Feature enrichment       │  │
   │  │  (Palette sidecar)        │  │
   │  └───────────────────────────┘  │
   │            │                    │
   │            ▼                    │
   │  ┌───────────────────────────┐  │
   │  │  Model inference (GPU)    │  │
   │  │  P95 latency target: 10ms │  │
   │  └───────────────────────────┘  │
   └─────────────────────────────────┘
        │
        ▼
   Prediction response
</code></pre>
<h3>Elastic CPU/GPU Sharing: The Efficiency Unlock</h3>
<p>One of the most operationally clever decisions Uber made was elastic resource sharing between training and serving.</p>
<p>Serving pods are optimized for latency. They need consistent, reserved GPU capacity. But at 3am, when trip demand is low, those GPUs sit mostly idle. Training jobs are throughput-oriented and can tolerate some preemption.</p>
<p>Michelangelo implements a reactive scheduling policy: idle serving capacity gets opportunistically allocated to training workloads during off-peak hours, then reclaimed when serving demand rises. This is not Kubernetes bin-packing out of the box. It requires custom scheduling logic in the Job Controller that is aware of serving SLAs.</p>
<p>The result is significantly higher overall GPU utilization. GPUs are expensive. Letting them sit idle at 3am is throwing money away.</p>
<h3>Autoscaling: Not Just HPA</h3>
<p>Standard Kubernetes HPA scales on CPU and memory. For ML serving, those metrics are nearly meaningless. A GPU can be at 30% utilization while serving at full capacity, and scaling on CPU would give you the wrong answer entirely.</p>
<p>Uber scales their serving pods on inference-specific metrics: requests per second, P95 latency, and GPU queue depth. These are exposed via Prometheus and consumed by a custom autoscaler that understands the difference between "GPU is underutilized" and "we're approaching saturation."</p>
<p>A simplified version of this HPA configuration looks like this:</p>
<pre><code class="language-yaml">apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: deepeta-serving-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: deepeta-triton
  minReplicas: 20
  maxReplicas: 200
  metrics:
  - type: Pods
    pods:
      metric:
        name: triton_inference_request_duration_p95_ms
      target:
        type: AverageValue
        averageValue: "8"   # target 8ms, alert at 10ms
  - type: Pods
    pods:
      metric:
        name: triton_inference_queue_duration_us
      target:
        type: AverageValue
        averageValue: "1000"  # queue time under 1ms
</code></pre>
<h2>The Feature Store: Solving Training-Serving Skew</h2>
<p>No discussion of Michelangelo's architecture is complete without <strong>Palette</strong>, Uber's centralized feature store. This is the piece that makes the whole system actually produce accurate predictions.</p>
<p>Training-serving skew is one of the most common silent killers of ML model accuracy. Your model trains on features computed one way (in batch, offline), but at serving time the same features are computed differently (in real-time, online). The predictions degrade subtly and the root cause is nearly impossible to debug.</p>
<p>Palette solves this with a dual-store architecture:</p>
<pre><code class="language-plaintext">Feature Definition (written once, in DSL)
        │
        ├──────────────────────────────────┐
        ▼                                  ▼
  OFFLINE STORE                      ONLINE STORE
  (HDFS / Hive)                      (Cassandra)
  Batch computation                  Real-time serving
  for training data                  &lt;5ms reads
        │                                  │
        ▼                                  ▼
  Model training                    Model serving
  (same features)                   (same features)
</code></pre>
<p>The same feature definition code runs in both contexts. Training and serving are guaranteed to compute features identically. Uber has over 20,000 features in Palette, shared across teams. When the rides team computes a "driver average rating in last 30 days" feature, the Eats team can reuse it directly rather than reimplementing it.</p>
<p>On Kubernetes, Palette features are injected into serving pods as a sidecar that handles real-time feature lookups from Cassandra. The serving pod doesn't need to know where features come from. It just calls the sidecar.</p>
<h2>What This Means for Your Organisation</h2>
<p>You are almost certainly not operating at Uber's scale. But the architectural patterns here apply at 10 clusters as much as at 1,000. Here's how to think about each layer for your own platform.</p>
<h3>1. Adopt the Operator pattern early, not retroactively</h3>
<p>If you are building an internal ML platform or even just standardizing how teams deploy models, the Operator pattern is the right foundation. The effort to implement it early is far less than the effort to migrate to it later. Start with a <code>ModelDeployment</code> CRD and a simple reconciler. Even a basic one eliminates entire categories of toil.</p>
<p>Tools to evaluate: <strong>Kubeflow</strong>, <strong>KServe</strong>, and <strong>BentoML</strong> all implement Operator-based ML lifecycle management and can give you this foundation without building from scratch.</p>
<h3>2. Separate your control plane from your data planes</h3>
<p>This is a general distributed systems principle that Michelangelo applies well. Things with different scaling characteristics and failure modes should be separated. Your lifecycle management system (control plane) should not go down when your serving infrastructure is under load (data plane). Keep them independent.</p>
<h3>3. Address multi-cluster before you need it</h3>
<p>Most teams add multi-cluster support reactively, when they're already in pain. The federation layer concept, a scheduler that sits above your clusters and abstracts away their heterogeneity, is worth designing for even if you start with a single cluster. It will save you from having to re-architect your scheduling logic later.</p>
<p>Tools worth evaluating: <strong>Kueue</strong> (CNCF project, now adopted by Google for batch ML workloads), <strong>Volcano</strong>, and <strong>YuniKorn</strong> as starting points for multi-cluster-aware scheduling.</p>
<h3>4. Use Triton if you have multiple model frameworks</h3>
<p>If your organisation runs more than one model type (XGBoost for tabular data plus PyTorch for deep learning is extremely common), a unified serving runtime eliminates massive operational overhead. Triton is open source, Kubernetes-native, and battle-tested at Uber scale.</p>
<h3>5. Solve training-serving skew before it bites you</h3>
<p>If you don't have a feature store, you have training-serving skew. You just don't know it yet. The dual-store pattern (batch offline store for training, low-latency online store for serving, same feature definitions for both) is the solution. Open-source options include <strong>Feast</strong> (CNCF), <strong>Hopsworks</strong>, and <strong>Tecton</strong> on the commercial side.</p>
<h3>6. Scale on inference metrics, not CPU</h3>
<p>If you're running GPU workloads on Kubernetes and scaling on CPU, you are scaling on the wrong signal. Expose Triton's Prometheus metrics and build your HPA around request latency and queue depth instead.</p>
<h2>The Honest Takeaways</h2>
<p>A few things worth noting that don't make it into keynote slides.</p>
<p><strong>They built a lot of this themselves.</strong> The Michelangelo Job Controller (federation layer) is not an open-source project you can deploy. It's proprietary engineering built over years. If you want this capability, you're either building it, buying it (CAST.ai, Run:ai), or starting with Kueue and extending it.</p>
<p><strong>The migration from Mesos to Kubernetes was years of work.</strong> Uber's Kubernetes story is not "we switched and everything was better." It was a multi-year migration with custom integrations for Ray, Spark, and their internal tooling. The payoff was real, but so was the cost.</p>
<p><strong>The 30M/s number is a peak stat.</strong> Average throughput is lower. Peak is what your architecture has to be designed for, but don't mistake peak for typical.</p>
<p><strong>Kubernetes was not sufficient alone.</strong> Every layer described in this post represents work Uber did <em>above</em> Kubernetes: the federation layer, the custom autoscaler, the Operator, the Palette sidecar. Kubernetes is the foundation, not the ceiling. Your ML platform will require the same philosophy.</p>
<h2>Further Reading</h2>
<p>All the architecture details in this post are grounded in Uber's own engineering publications. If you want to go deeper:</p>
<ul>
<li><p><a href="https://www.uber.com/blog/michelangelo-machine-learning-platform/">Meet Michelangelo: Uber's Machine Learning Platform</a> — the original 2017 post, still worth reading for the foundational thinking</p>
</li>
<li><p><a href="https://www.uber.com/blog/ubers-journey-to-ray-on-kubernetes-ray-setup/">Uber's Journey to Ray on Kubernetes</a> — the Mesos to Kubernetes migration story in detail</p>
</li>
<li><p><a href="https://www.uber.com/us/en/blog/scaling-ai-ml-infrastructure-at-uber/">Scaling AI/ML Infrastructure at Uber</a> — the federation layer and GPU scaling decisions</p>
</li>
<li><p><a href="https://www.uber.com/us/en/blog/open-source-and-in-house-how-uber-optimizes-llm-training/">Open Source and In-House: How Uber Optimizes LLM Training</a> — the LLM training stack on top of the Kubernetes federation</p>
</li>
<li><p><a href="https://www.uber.com/en-GB/blog/from-predictive-to-generative-ai/">From Predictive to Generative: How Michelangelo Accelerates Uber's AI Journey</a> — the 2024 overview of where the platform stands today</p>
</li>
<li><p>KubeCon EU 2026 keynote recordings on the <a href="https://www.youtube.com/@cncf">CNCF YouTube Channel</a> — the Uber segment with the 30M/s figures came from this keynote</p>
</li>
</ul>
<p><em>Insigh8s (i8s) is an open-source Kubernetes reference architecture project, an opinionated CNCF-only blueprint for building production-grade Kubernetes platforms. Tech Stories is a recurring series covering how leading engineering organisations solve real infrastructure problems at scale.</em></p>
]]></content:encoded></item></channel></rss>