How Uber Optimizes LLM Training · April 2026

Michelangelo OSS
Business Use Case

Meet Michelangelo. Uber's Machine Learning Platform.

By Jade Melody Walker

AI InfrastructureML PlatformOpen Source

Independent enterprise use case based on public engineering materials. Not affiliated with Uber.

Demo video

Walkthrough · ~4 min

Your first ML Michelangelo OSS pull request: pipeline reruns and auto-flip

[ video file goes here ]

FILE: ma-oss-1075-walkthrough.mp4UBER · MA OSS

§ 06

Issue glossary

20 real open issues from michelangelo-ai/michelangelo, sorted into 5 buckets. Dense engineering language translated into practitioner-readable framing.

Bucket 01 / 05

Pipeline & Trigger Lifecycle

How jobs are scheduled, versioned, retried, and cleaned up

#	Issue	What it means
#1075	Support trigger batch rerun and auto flip due to pipeline revision good first issuemissing-feature	When you update your pipeline, scheduled triggers don't pick up the new version automatically. You have to manually rewire them one by one. This adds the runtime logic that should have shipped with the feature.
#1091	Support Pipeline Cascade Delete missing-feature	Deleting a pipeline doesn't clean up the jobs and resources attached to it. You delete the parent; the children stay orphaned. Cascade delete fixes the cleanup gap.
#1055	Support trigger run creation and more detailed trigger run display page missing-feature	You can set up triggers but you can't see meaningful detail about what ran, when, or why it failed. This adds both the creation surface and the observability page.
#793	Support pipeline run retry with immutable pipeline run state	If a pipeline run fails, there's no retry mechanism — you'd recreate from scratch. This adds retry while keeping the original run record intact so your history stays clean.

Bucket 02 / 05

CLI & Developer Tooling (mactl)

The developer entry surface — commands, config, scaffolding

#	Issue	What it means
#946	Add root-level Makefile to simplify developer setup and build commands	Getting the project running locally requires knowing which commands to run in which order. A Makefile turns that into one command. The difference between "clone and figure it out" and "clone and contribute."
#937	Auto-generate mactl config templates from CRD spec	Config templates for the CLI are hand-written and can drift from what the schema supports. Auto-generating from the CRD spec keeps templates accurate without manual maintenance.
#936	Auto-generate CLI arguments from CRD OpenAPI schema	CLI flags get out of sync with the API schema over time. Generating them from the schema source keeps the developer surface honest — same problem as above, one layer up.
#934	MaCTL support for `worker_queue` annotation on Project CRD	The `worker_queue` annotation exists in the platform spec but the CLI doesn't expose it. You can set it in YAML but can't interact with it through `mactl`. This closes the gap.

Bucket 03 / 05

Platform Schema (K8s / CRD / Spec)

How the platform's resource definitions align to Kubernetes standards

#	Issue	What it means
#932	Replace `hdfs_delegation_token` plain string with K8s Secret reference	Credentials for data access are stored as plain strings in config. This moves them to K8s Secrets — how every security-conscious platform handles sensitive values.
#931	Replace `ResourceSpec` custom types with K8s `resource.Quantity`	The platform defined its own resource types instead of using the K8s standard. This creates friction for anyone who already knows Kubernetes. This issue aligns the schema to the ecosystem standard.
#929	Add priority information to `JobPrioritySpec`	There's a spec object for job priority but it doesn't carry priority data yet. You can reference it but it doesn't do anything. This adds the substance behind the structure.
#928	Add inter-job and anti-affinity to `JobSchedulingSpec`	You can't tell the platform "don't run these two jobs on the same node." Standard K8s scheduling controls, not yet exposed at the job spec level.

Bucket 04 / 05

Documentation & Public Readiness

What blocks external contributors from day one

#	Issue	What it means
#822	Documentation Website Polish for Open-Source Launch documentationpublic-readiness	The docs site exists but isn't ready for external eyes — navigation, structure, and presentation still reflect an internal-first audience. This is the public launch polish pass.
#771	Document: Single Docker container for developer onboarding public-readiness	There's no single-container setup for someone who wants to run Michelangelo locally and start contributing. This is the "zero to running" gap that blocks every new external contributor.
#770	Document: YAML Configuration — Key concepts & terms public-readiness	The YAML config is how practitioners interact with the platform daily. No glossary. No term definitions. Engineers have to reverse-engineer what each field does from the schema.
#1057	docs: ingester-design.md incorrectly claims ingester watches 13 CRDs ( 0 since PR #1012)	The architecture docs describe behavior that a merged PR already changed. External contributors are building a mental model of a system that no longer exists.

Bucket 05 / 05

UI & Workflow Visibility

What practitioners can see and act on inside the platform

#	Issue	What it means
#1094	UI not display workflow DAG as defined in python code	You define your pipeline in Python and the visual interface is supposed to reflect it. It doesn't. The DAG view and the code are out of sync — breaking the primary feedback loop for pipeline authors.
#1088	Restructure test files into nested describes good first issuerefactoring	Test files are flat with no logical grouping by feature or behavior. When a test fails, it's hard to tell what it's testing. Nested describes make the test suite readable and contributor-friendly.
#927	Project phase transitions have no side effects	Moving a project between lifecycle phases (draft → active → archived) should trigger downstream behaviors. Currently the state changes but nothing else responds.
#836	Failure to connect temporal requires TLS	The workflow orchestration layer fails to connect when TLS is required — a standard configuration in any production or security-conscious environment. Blocks deployment in hardened infra.
#442	Add loading state to Execution component and integrate with Detail page rendering buggood first issue	The Execution component renders immediately with no loading state. While data is fetching, the UI shows nothing or stale content. Adds the standard loading pattern so the page feels intentional, not broken.
#349	Update detail view header to display truncated titles in a tooltip enhancementgood first issue	Long pipeline or job names get cut off in the header with no way to see the full title. Adds a tooltip so engineers can always read the full name without navigating away.

Issue

#1075 walkthrough

Support trigger batch rerun and auto flip due to pipeline revision

good first issue missing-feature

repo:github.com/michelangelo-ai/michelangelo docs:michelangelo-ai.github.io/michelangelo issue:github.com/michelangelo-ai/michelangelo/issues/1075

The problem

Trigger runs with failed pipeline runs require manual re-triggering one by one. Triggers also pin to a specific pipeline revision — when a new revision registers, the trigger does not auto-flip. The auto_flip field is already in the YAML schema and protobuf spec. The runtime logic is missing.

The translation

This is how I translate this from my last project — production enterprise batch processing world. A scheduled job fails. It holds up the rest of the batch. Operations gets the call. The failed step gets skipped so the rest of the batch keeps moving, and the failure gets inspected the next morning — usually a vendor file that did not arrive on time. Auto-flip is the same operator instinct: when a new revision registers upstream, the system picks it up without a human recreating the trigger. The vocabulary changes across ecosystems. The problem does not.

Contributor path

01Read the issue body — solution shape is already scoped into two parts.

02Comment on the issue to scope the work and ask whether to ship batch rerun and auto-flip as one PR or two.

03Fork the repo, branch, and locate the trigger runtime in the codebase.

04Activate the dormant auto_flip runtime logic; add the batch rerun CLI surface (ma trigger_run rerun).

05Write tests, open a PR linking back to #1075 with "Closes #1075".

06Iterate on maintainer review. Merge ships in the next release.

If you take this issue on, tag me. I want to see this PR closed.

Supporting work · Public repos

Token-estimator→ repo

Pre-deployment cost planning

CLI that gives ML engineers token count and cost transparency before a single API call ships to production.

ContextEval→ repo

RAG grounding enforcement

Returns answers only when information exists in the defined corpus. Closes the hallucination gap in production RAG pipelines.

Trace-bench→ repo

Output stability + regression

Benchmarks model behavior across versions to surface silent drift and regression in ML serving layers.

Technical storytelling · ContextEval

ContextEval · RAG grounding enforcement pipeline

How production RAG pipelines close the hallucination gap

Input

User Query

Natural language question submitted to the pipeline

↓

Step 01 — Retrieval

Corpus Search

FAISS index · Chunked document store · Similarity ranking

↓

Step 02 — Gate

Confidence Threshold Check

Retrieval score meets threshold → pass · Below threshold → refuse

↓

Pass

Grounded answer

Response sourced from corpus only

Fail

"Not in corpus"

Refuses to answer. No hallucination.

The strategic value: ContextEval applies the same production discipline to RAG pipelines that batch systems apply to data contracts — if the record doesn't exist in the source, the system doesn't invent one.

Platform strategy · Visual thesis

Layer A — Supply

Open-source ecosystem

Llama 2 / 3MixtralRayPyTorchDeepSpeedHugging FacevLLMLoRA / QLoRA

↓

Layer B — Conversion

Michelangelo platform layer

trainingorchestrationscoringresource schedulingmodel lifecyclefederation

↓

Layer C — Application

Uber-scale production systems

searchrecommendationssupport chatbotscode generationSQL generation

↓

Layer D — Outcomes

Business outcomes

speedscalecost controlreliabilityproduction readiness

§ 01

Executive summary

01Michelangelo functions as Uber's internal AI/ML platform layer — production-tested at scale across classic ML, deep learning, LLMs, and now agents.

02Open source accelerates model training, experimentation, and infrastructure development. The platform is built on Ray, Kubernetes, PyTorch, DeepSpeed, Hugging Face, and vLLM.

03In-house systems convert open-source components into production-ready platform capability. The strategic value is the conversion, not the components themselves.

04The DevRel opportunity is likely about translating this platform story into practitioner trust, adoption, and clear technical education for external ML platform engineers.

§ 02

Corporate objective mapping

A read of how Michelangelo functions might map to Uber's stated technology objectives.

Objective	MA Function	OSS Angle
Leading technology / platform capability	Centralized internal ML platform — training, evaluation, scaling	Open-source frameworks accelerate frontier-model adoption
Developer velocity	CLI (mactl), Canvas, notebooks, JobSpec abstraction	Hugging Face, Ray Train reduce engineering effort
Production readiness	Federation across multi-cluster K8s, JobSpec lifecycle	Kubernetes + Ray + KubeRay as the orchestration backbone
Cost and throughput optimization	DeepSpeed ZeRO, flash attention, batch tuning, MFU benchmarks	Industry-standard MFU using OSS optimizers
Trust, reliability, evaluation	Offline LLM scorer, vLLM-based batch prediction, throughput benchmarks	Ray ActorPool + vLLM as the evaluation substrate
Open-source ecosystem credibility	MA OSS public repo, "good first issue" labels, contributor onboarding	External adoption signals platform durability beyond Uber

§ 03

Architecture interpretation

Layered view of the platform stack as described in public engineering materials. Read bottom-up — hardware to business use case.

Layer 07

Business use cases

Eats recommendationsSearchSupport chatbotsCode developmentSQL generation

Layer 06

Model layer

Llama 2 / 3MixtralClosed-source providersFine-tuned internal models

Layer 05

Training stack

PyTorchRay TrainHugging Face TransformersDeepSpeedNCCL

Layer 04

Orchestration

KubernetesRayKubeRayFederation Layer

Layer 03

Hardware

NVIDIA A100 (on-prem)NVIDIA H100 (Google Cloud)Crane infra stack

Layer 02

Evaluation / scoring

Offline LLM scorervLLMBatch predictionMFU benchmarks

Layer 01

Production decisions

Resource planningModel selectionScalabilityCost-performance tradeoffs

§ 04

Open source vs in-house

Two columns. The strategic value lives in the conversion between them.

Open source provides

Speed — frontier techniques in days, not quarters
Reusable model ecosystems (Llama, Mixtral, Falcon)
Community-tested frameworks (Ray, DeepSpeed, vLLM)
Lower experimentation friction
Faster access to fine-tuning recipes (PEFT, LoRA, QLoRA)

In-house platform provides

Production integration at Uber-scale traffic
Multi-cluster federation and resource management
Governance, security, compliance posture
Reliability — 99.99% uptime targets
Domain-specific tuning (Eats menu, search semantics)
Internal business alignment

The strategic value is not open source alone. The value is converting open-source capability into reliable internal platform leverage.

§ 05

Source article

Primary source

Open Source and In-House: How Uber Optimizes LLM Training

Bo Ling, Jiapei Huang, Baojun Liu, Chongxiao Cao, Anant Vyas, Peng Zhang

Uber Engineering Blog · April 14, 2026

Relevant points

Llama 3 450B is on the near-term roadmap for fine-tuning support.
Throughput optimization through DeepSpeed ZeRO-stage-3 CPU offload yielded 34% GPU memory reduction.
Flash attention saves 50% GPU memory at the same batch size.
mactl is the named CLI tool. Canvas + Notebooks complete the developer entry surface.
Comet is the experiment tracking server. TerrBlob is the blob storage layer. HDFS is data ingest.
The conclusion explicitly states open source as ecosystem strategy: embracing open source is the key to catching up with generative AI trends.

Read source →

Michelangelo OSSBusiness Use Case