Fixing GPU Starvation in Large-Scale Distributed Training
Get the full intelligence
Search transcripts, export clips, track mentions, and explore all topics from “Fixing GPU Starvation in Large-Scale Distributed Training” inside PodZeus.
This episode of MLOps.community dives deep into the critical challenge of GPU starvation in large-scale distributed training, using real-world war stories from Kashish, a Staff Engineer at Uber leading the marketplace matching team. The core issue isn't model complexity or hardware limitations, but data I/O bottlenecks that starve GPUs of the data they need to operate efficiently. Kashish recounts a pivotal case where training utilization was stuck at 15–20% on A100 chips, despite a seemingly sound model. After eliminating variables by loading data entirely into RAM (boosting utilization to 85%), the team discovered the real culprit: the PyArrow-to-NumPy transformation overhead during data pipeline processing. By caching the transformed NumPy data instead of reprocessing it, they achieved 85% utilization and reduced training time from a day to just a few hours. The episode also explores broader themes like the trade-off between efficiency and reproducibility in parallel data loading, the importance of warm-starting models for inference, and the growing role of AI agents in debugging and workflow automation. The narrative unfolds like a detective story, emphasizing that the most impactful optimizations often lie in data engineering, not model architecture.
GPU utilization issues are rarely due to model complexity—data I/O bottlenecks are the primary culprit.
Caching transformed data (e.g., NumPy arrays) instead of reprocessing PyArrow on every epoch can dramatically improve GPU utilization.
Even with high parallelism, model quality can degrade without deterministic data ordering—solved via per-worker queues and pre-allocated work distribution.
Warm-starting inference models with synthetic traffic prevents cold-start latency and ensures consistent performance.
AI agents can accelerate development, but only when paired with structured workflows, clear context (e.g., TL;DR comments), and deliberate reasoning steps.
The GPU Starvation Crisis: When Data Feeds Are the Bottleneck
“The model is there and that's why you're saying you're starving the GPU. It's there but you can't feed the data that that model needs.”
The Detective Work: Uncovering the PyArrow Bottleneck
“This PyArrow to NumPy translation is done on the fly when we read the data from the queue. And when we were doing everything in memory, everything was NumPy. So this small thing which is the translation from the different data types because I mean if GPU understand pi or sure that has been like best thing but it's not, it doesn't.”
Solving the Double Bottleneck: Caching Transformed Data
“We can just cache the transformed output. You don't need to do transformation every time, right? Just cache. The cache can always have NumPy.”
Reproducibility vs. Efficiency: The Hidden Trade-Off
Increased parallelism boosted efficiency but caused model quality degradation due to non-deterministic data ordering. The team solved this by introducing per-worker queues with pre-allocated, deterministic work distribution, ensuring reproducible results without sacrificing performance.
Serving Challenges: Latency, Efficiency, and the Cost of Fresh Data
The discussion shifts to serving, where trade-offs between latency and efficiency are acute. Waiting for full batches increases latency, while padding wastes GPU capacity. Real-time features can't be cached, making data retrieval costly and blocking inference.
“This PyArrow to NumPy translation is done on the fly when we read the data from the queue. And when we were doing everything in memory, everything was NumPy. So this small thing which is the translation from the different data types because I mean if GPU understand pi or sure that has been like best thing but it's not, it doesn't.”
“The model is there and that's why you're saying you're starving the GPU. It's there but you can't feed the data that that model needs.”
“We can just cache the transformed output. You don't need to do transformation every time, right? Just cache. The cache can always have NumPy.”
Host
Guest
Kashish
person
Uber
organization
Petastorm
product
PyArrow
product
organization
NumPy
product
A100
other
Rob
person
YouTube Ads
product
TPU
other
Spec Driven Development, Workflows, and the Recent Coding Agent Conference
MLOps.community • 59m • 3/31/2026
Getting Humans Out of the Way: How to Work with Teams of Agents
MLOps.community • 50m • 4/7/2026
We Cut LLM Latency by 70% in Production
MLOps.community • 1h 5m • 4/10/2026
The Modern Software Engineer
MLOps.community • 53m • 4/14/2026
Why Agents are Driving Software Development to the Cloud
MLOps.community • 51m • 4/17/2026
Get the full intelligence
Search transcripts, export clips, track mentions, and explore all topics from “Fixing GPU Starvation in Large-Scale Distributed Training” inside PodZeus.
Start discovering podcast insights today
Start with a 7-day trial and explore a growing catalog of popular podcasts. No credit card required.
No credit card required • 7-day trial • Cancel anytime
