Fixing GPU Starvation in Large-Scale Distributed Training

MLOps.community52mApril 3, 2026

Get the full intelligence

Search transcripts, export clips, track mentions, and explore all topics from “Fixing GPU Starvation in Large-Scale Distributed Training” inside PodZeus.

AI-Generated Summary

This episode of MLOps.community dives deep into the critical challenge of GPU starvation in large-scale distributed training, using real-world war stories from Kashish, a Staff Engineer at Uber leading the marketplace matching team. The core issue isn't model complexity or hardware limitations, but data I/O bottlenecks that starve GPUs of the data they need to operate efficiently. Kashish recounts a pivotal case where training utilization was stuck at 15–20% on A100 chips, despite a seemingly sound model. After eliminating variables by loading data entirely into RAM (boosting utilization to 85%), the team discovered the real culprit: the PyArrow-to-NumPy transformation overhead during data pipeline processing. By caching the transformed NumPy data instead of reprocessing it, they achieved 85% utilization and reduced training time from a day to just a few hours. The episode also explores broader themes like the trade-off between efficiency and reproducibility in parallel data loading, the importance of warm-starting models for inference, and the growing role of AI agents in debugging and workflow automation. The narrative unfolds like a detective story, emphasizing that the most impactful optimizations often lie in data engineering, not model architecture.

Key Takeaways
1

GPU utilization issues are rarely due to model complexity—data I/O bottlenecks are the primary culprit.

2

Caching transformed data (e.g., NumPy arrays) instead of reprocessing PyArrow on every epoch can dramatically improve GPU utilization.

3

Even with high parallelism, model quality can degrade without deterministic data ordering—solved via per-worker queues and pre-allocated work distribution.

4

Warm-starting inference models with synthetic traffic prevents cold-start latency and ensures consistent performance.

5

AI agents can accelerate development, but only when paired with structured workflows, clear context (e.g., TL;DR comments), and deliberate reasoning steps.

Chapters
0:00
10 min

The GPU Starvation Crisis: When Data Feeds Are the Bottleneck

The model is there and that's why you're saying you're starving the GPU. It's there but you can't feed the data that that model needs.

Highlight
10:00
10 min

The Detective Work: Uncovering the PyArrow Bottleneck

This PyArrow to NumPy translation is done on the fly when we read the data from the queue. And when we were doing everything in memory, everything was NumPy. So this small thing which is the translation from the different data types because I mean if GPU understand pi or sure that has been like best thing but it's not, it doesn't.

Highlight
20:00
10 min

Solving the Double Bottleneck: Caching Transformed Data

We can just cache the transformed output. You don't need to do transformation every time, right? Just cache. The cache can always have NumPy.

Highlight
30:00
10 min

Reproducibility vs. Efficiency: The Hidden Trade-Off

Increased parallelism boosted efficiency but caused model quality degradation due to non-deterministic data ordering. The team solved this by introducing per-worker queues with pre-allocated, deterministic work distribution, ensuring reproducible results without sacrificing performance.

40:00
10 min

Serving Challenges: Latency, Efficiency, and the Cost of Fresh Data

The discussion shifts to serving, where trade-offs between latency and efficiency are acute. Waiting for full batches increases latency, while padding wastes GPU capacity. Real-time features can't be cached, making data retrieval costly and blocking inference.

High-Impact Quotes
This PyArrow to NumPy translation is done on the fly when we read the data from the queue. And when we were doing everything in memory, everything was NumPy. So this small thing which is the translation from the different data types because I mean if GPU understand pi or sure that has been like best thing but it's not, it doesn't.
Kashish19:10
Viral: 90.0
The model is there and that's why you're saying you're starving the GPU. It's there but you can't feed the data that that model needs.
Kashish6:54
Viral: 85.0
We can just cache the transformed output. You don't need to do transformation every time, right? Just cache. The cache can always have NumPy.
Kashish19:56
Viral: 80.0
Speakers

Host

Host

Guest

Kashish
Topics Discussed
GPU Utilization Optimization95%Data I/O Bottlenecks90%PyArrow to NumPy Transformation Overhead85%Reproducibility in Distributed Training80%Model Serving Trade-offs75%AI Agent Workflows70%Warm-Starting Inference Models70%Voice-Based Coding Sessions65%
People & Brands

Kashish

person

12xPositive

Uber

organization

10xPositive

Petastorm

product

8xPositive

PyArrow

product

6xNeutral

Google

organization

5xNeutral

NumPy

product

5xPositive

A100

other

4xNeutral

Rob

person

3xNeutral

YouTube Ads

product

3xPositive

TPU

other

3xNeutral

Get the full intelligence

Search transcripts, export clips, track mentions, and explore all topics from “Fixing GPU Starvation in Large-Scale Distributed Training” inside PodZeus.

Start discovering podcast insights today

Start with a 7-day trial and explore a growing catalog of popular podcasts. No credit card required.

No credit card required • 7-day trial • Cancel anytime