How Uber Optimizes LLM Training · April 2026
Michelangelo OSS
Business Use Case
Meet Michelangelo. Uber's Machine Learning Platform.
Independent enterprise use case based on public engineering materials. Not affiliated with Uber.
Demo video
Walkthrough · ~4 min
Your first ML Michelangelo OSS pull request: pipeline reruns and auto-flip
[ video file goes here ]
§ 06
Issue glossary
20 real open issues from michelangelo-ai/michelangelo, sorted into 5 buckets. Dense engineering language translated into practitioner-readable framing.
Bucket 01 / 05
Pipeline & Trigger Lifecycle
How jobs are scheduled, versioned, retried, and cleaned up
| # | Issue | What it means |
|---|---|---|
| #1075 | Support trigger batch rerun and auto flip due to pipeline revision good first issuemissing-feature |
When you update your pipeline, scheduled triggers don't pick up the new version automatically. You have to manually rewire them one by one. This adds the runtime logic that should have shipped with the feature. |
| #1091 | Support Pipeline Cascade Delete missing-feature |
Deleting a pipeline doesn't clean up the jobs and resources attached to it. You delete the parent; the children stay orphaned. Cascade delete fixes the cleanup gap. |
| #1055 | Support trigger run creation and more detailed trigger run display page missing-feature |
You can set up triggers but you can't see meaningful detail about what ran, when, or why it failed. This adds both the creation surface and the observability page. |
| #793 | Support pipeline run retry with immutable pipeline run state |
If a pipeline run fails, there's no retry mechanism — you'd recreate from scratch. This adds retry while keeping the original run record intact so your history stays clean. |
Bucket 02 / 05
CLI & Developer Tooling (mactl)
The developer entry surface — commands, config, scaffolding
| # | Issue | What it means |
|---|---|---|
| #946 | Add root-level Makefile to simplify developer setup and build commands |
Getting the project running locally requires knowing which commands to run in which order. A Makefile turns that into one command. The difference between "clone and figure it out" and "clone and contribute." |
| #937 | Auto-generate mactl config templates from CRD spec |
Config templates for the CLI are hand-written and can drift from what the schema supports. Auto-generating from the CRD spec keeps templates accurate without manual maintenance. |
| #936 | Auto-generate CLI arguments from CRD OpenAPI schema |
CLI flags get out of sync with the API schema over time. Generating them from the schema source keeps the developer surface honest — same problem as above, one layer up. |
| #934 | MaCTL support for worker_queue annotation on Project CRD |
The worker_queue annotation exists in the platform spec but the CLI doesn't expose it. You can set it in YAML but can't interact with it through mactl. This closes the gap. |
Bucket 03 / 05
Platform Schema (K8s / CRD / Spec)
How the platform's resource definitions align to Kubernetes standards
| # | Issue | What it means |
|---|---|---|
| #932 | Replace hdfs_delegation_token plain string with K8s Secret reference |
Credentials for data access are stored as plain strings in config. This moves them to K8s Secrets — how every security-conscious platform handles sensitive values. |
| #931 | Replace ResourceSpec custom types with K8s resource.Quantity |
The platform defined its own resource types instead of using the K8s standard. This creates friction for anyone who already knows Kubernetes. This issue aligns the schema to the ecosystem standard. |
| #929 | Add priority information to JobPrioritySpec |
There's a spec object for job priority but it doesn't carry priority data yet. You can reference it but it doesn't do anything. This adds the substance behind the structure. |
| #928 | Add inter-job and anti-affinity to JobSchedulingSpec |
You can't tell the platform "don't run these two jobs on the same node." Standard K8s scheduling controls, not yet exposed at the job spec level. |
Bucket 04 / 05
Documentation & Public Readiness
What blocks external contributors from day one
| # | Issue | What it means |
|---|---|---|
| #822 | Documentation Website Polish for Open-Source Launch documentationpublic-readiness |
The docs site exists but isn't ready for external eyes — navigation, structure, and presentation still reflect an internal-first audience. This is the public launch polish pass. |
| #771 | Document: Single Docker container for developer onboarding public-readiness |
There's no single-container setup for someone who wants to run Michelangelo locally and start contributing. This is the "zero to running" gap that blocks every new external contributor. |
| #770 | Document: YAML Configuration — Key concepts & terms public-readiness |
The YAML config is how practitioners interact with the platform daily. No glossary. No term definitions. Engineers have to reverse-engineer what each field does from the schema. |
| #1057 | docs: ingester-design.md incorrectly claims ingester watches 13 CRDs ( 0 since PR #1012) |
The architecture docs describe behavior that a merged PR already changed. External contributors are building a mental model of a system that no longer exists. |
Bucket 05 / 05
UI & Workflow Visibility
What practitioners can see and act on inside the platform
| # | Issue | What it means |
|---|---|---|
| #1094 | UI not display workflow DAG as defined in python code |
You define your pipeline in Python and the visual interface is supposed to reflect it. It doesn't. The DAG view and the code are out of sync — breaking the primary feedback loop for pipeline authors. |
| #1088 | Restructure test files into nested describes good first issuerefactoring |
Test files are flat with no logical grouping by feature or behavior. When a test fails, it's hard to tell what it's testing. Nested describes make the test suite readable and contributor-friendly. |
| #927 | Project phase transitions have no side effects |
Moving a project between lifecycle phases (draft → active → archived) should trigger downstream behaviors. Currently the state changes but nothing else responds. |
| #836 | Failure to connect temporal requires TLS |
The workflow orchestration layer fails to connect when TLS is required — a standard configuration in any production or security-conscious environment. Blocks deployment in hardened infra. |
| #442 | Add loading state to Execution component and integrate with Detail page rendering buggood first issue |
The Execution component renders immediately with no loading state. While data is fetching, the UI shows nothing or stale content. Adds the standard loading pattern so the page feels intentional, not broken. |
| #349 | Update detail view header to display truncated titles in a tooltip enhancementgood first issue |
Long pipeline or job names get cut off in the header with no way to see the full title. Adds a tooltip so engineers can always read the full name without navigating away. |
Issue
#1075 walkthrough
Support trigger batch rerun and auto flip due to pipeline revision
good first issue
missing-feature
The problem
Trigger runs with failed pipeline runs require manual re-triggering one by one. Triggers also pin to a specific pipeline revision — when a new revision registers, the trigger does not auto-flip. The auto_flip field is already in the YAML schema and protobuf spec. The runtime logic is missing.
The translation
This is how I translate this from my last project — production enterprise batch processing world. A scheduled job fails. It holds up the rest of the batch. Operations gets the call. The failed step gets skipped so the rest of the batch keeps moving, and the failure gets inspected the next morning — usually a vendor file that did not arrive on time. Auto-flip is the same operator instinct: when a new revision registers upstream, the system picks it up without a human recreating the trigger. The vocabulary changes across ecosystems. The problem does not.
Contributor path
01Read the issue body — solution shape is already scoped into two parts.
02Comment on the issue to scope the work and ask whether to ship batch rerun and auto-flip as one PR or two.
03Fork the repo, branch, and locate the trigger runtime in the codebase.
04Activate the dormant auto_flip runtime logic; add the batch rerun CLI surface (ma trigger_run rerun).
05Write tests, open a PR linking back to #1075 with "Closes #1075".
06Iterate on maintainer review. Merge ships in the next release.
If you take this issue on, tag me. I want to see this PR closed.
Supporting work · Public repos
Token-estimator→ repo
Pre-deployment cost planning
CLI that gives ML engineers token count and cost transparency before a single API call ships to production.
ContextEval→ repo
RAG grounding enforcement
Returns answers only when information exists in the defined corpus. Closes the hallucination gap in production RAG pipelines.
Trace-bench→ repo
Output stability + regression
Benchmarks model behavior across versions to surface silent drift and regression in ML serving layers.
Technical storytelling · ContextEval
ContextEval · RAG grounding enforcement pipeline
How production RAG pipelines close the hallucination gap
Input
User Query
Natural language question submitted to the pipeline
↓
Step 01 — Retrieval
Corpus Search
FAISS index · Chunked document store · Similarity ranking
↓
Step 02 — Gate
Confidence Threshold Check
Retrieval score meets threshold → pass · Below threshold → refuse
↓
Pass
Grounded answer
Response sourced from corpus only
Fail
"Not in corpus"
Refuses to answer. No hallucination.
The strategic value: ContextEval applies the same production discipline to RAG pipelines that batch systems apply to data contracts — if the record doesn't exist in the source, the system doesn't invent one.
Platform strategy · Visual thesis
Layer A — Supply
Open-source ecosystem
Llama 2 / 3MixtralRayPyTorchDeepSpeedHugging FacevLLMLoRA / QLoRA
↓
Layer B — Conversion
Michelangelo platform layer
trainingorchestrationscoringresource schedulingmodel lifecyclefederation
↓
Layer C — Application
Uber-scale production systems
searchrecommendationssupport chatbotscode generationSQL generation
↓
Layer D — Outcomes
Business outcomes
speedscalecost controlreliabilityproduction readiness
§ 01
Executive summary
01Michelangelo functions as Uber's internal AI/ML platform layer — production-tested at scale across classic ML, deep learning, LLMs, and now agents.
02Open source accelerates model training, experimentation, and infrastructure development. The platform is built on Ray, Kubernetes, PyTorch, DeepSpeed, Hugging Face, and vLLM.
03In-house systems convert open-source components into production-ready platform capability. The strategic value is the conversion, not the components themselves.
04The DevRel opportunity is likely about translating this platform story into practitioner trust, adoption, and clear technical education for external ML platform engineers.
§ 02
Corporate objective mapping
A read of how Michelangelo functions might map to Uber's stated technology objectives.
| Objective | MA Function | OSS Angle |
|---|---|---|
| Leading technology / platform capability | Centralized internal ML platform — training, evaluation, scaling | Open-source frameworks accelerate frontier-model adoption |
| Developer velocity | CLI (mactl), Canvas, notebooks, JobSpec abstraction | Hugging Face, Ray Train reduce engineering effort |
| Production readiness | Federation across multi-cluster K8s, JobSpec lifecycle | Kubernetes + Ray + KubeRay as the orchestration backbone |
| Cost and throughput optimization | DeepSpeed ZeRO, flash attention, batch tuning, MFU benchmarks | Industry-standard MFU using OSS optimizers |
| Trust, reliability, evaluation | Offline LLM scorer, vLLM-based batch prediction, throughput benchmarks | Ray ActorPool + vLLM as the evaluation substrate |
| Open-source ecosystem credibility | MA OSS public repo, "good first issue" labels, contributor onboarding | External adoption signals platform durability beyond Uber |
§ 03
Architecture interpretation
Layered view of the platform stack as described in public engineering materials. Read bottom-up — hardware to business use case.
Layer 07
Business use cases
Eats recommendationsSearchSupport chatbotsCode developmentSQL generation
Layer 06
Model layer
Llama 2 / 3MixtralClosed-source providersFine-tuned internal models
Layer 05
Training stack
PyTorchRay TrainHugging Face TransformersDeepSpeedNCCL
Layer 04
Orchestration
KubernetesRayKubeRayFederation Layer
Layer 03
Hardware
NVIDIA A100 (on-prem)NVIDIA H100 (Google Cloud)Crane infra stack
Layer 02
Evaluation / scoring
Offline LLM scorervLLMBatch predictionMFU benchmarks
Layer 01
Production decisions
Resource planningModel selectionScalabilityCost-performance tradeoffs
§ 04
Open source vs in-house
Two columns. The strategic value lives in the conversion between them.
Open source provides
- Speed — frontier techniques in days, not quarters
- Reusable model ecosystems (Llama, Mixtral, Falcon)
- Community-tested frameworks (Ray, DeepSpeed, vLLM)
- Lower experimentation friction
- Faster access to fine-tuning recipes (PEFT, LoRA, QLoRA)
In-house platform provides
- Production integration at Uber-scale traffic
- Multi-cluster federation and resource management
- Governance, security, compliance posture
- Reliability — 99.99% uptime targets
- Domain-specific tuning (Eats menu, search semantics)
- Internal business alignment
The strategic value is not open source alone. The value is converting open-source capability into reliable internal platform leverage.
§ 05
Source article
Primary source
Open Source and In-House: How Uber Optimizes LLM Training
Uber Engineering Blog · April 14, 2026
Relevant points
- Llama 3 450B is on the near-term roadmap for fine-tuning support.
- Throughput optimization through DeepSpeed ZeRO-stage-3 CPU offload yielded 34% GPU memory reduction.
- Flash attention saves 50% GPU memory at the same batch size.
- mactl is the named CLI tool. Canvas + Notebooks complete the developer entry surface.
- Comet is the experiment tracking server. TerrBlob is the blob storage layer. HDFS is data ingest.
- The conclusion explicitly states open source as ecosystem strategy: embracing open source is the key to catching up with generative AI trends.