Scaling AI Infrastructure: Strategies for Resilient Fleet Operations

What does it take to run one of the world's largest AI supercomputers? As artificial intelligence workloads grow exponentially, operating a hyperscale AI cloud fleet demands new strategies for resilience, efficiency, and operational excellence. This session explores Microsoft’s approach to scaling infrastructure for 100X growth, focusing on the intersection of system innovation and advanced fleet management.

Session Topics:

Storage

Speaker(s):

Author:

Dharmesh Patel

Partner, Manufacturing Quality Engineering

Microsoft

Dharmesh Patel serves as the General Manager and head of the Quality Engineering Organization at Microsoft. In this capacity, he oversees the AI Fleet Quality team to ensure AI capacity, stability, and reliability throughout the hardware supply chain from manufacturing to data centers. His responsibilities include enabling Microsoft to scale AI capacity while maintaining high hardware quality standards across all stages of product development from concept through mass production. With nearly twenty years of experience in managing complex products and promoting process excellence within data centers, Dharmesh is a recognized leader in his field.

Author:

Prabhat Ram

Partner, Software Architect

Microsoft

Prabhat leads the AI Customer Experience team within Microsoft Azure. He is responsible for operating AI Training supercomputers for OpenAI and other strategic customers. He holds a master’s in Computer Science from Brown University and a PhD from the Earth and Planetary Sciences department at U.C. Berkeley.

In addition to coauthoring more than 150 papers on computer and domain sciences, his work has been recognized throughout the industry including being awarded the 2018 ACM Gordon Bell Prize for his team’s work on Exascale Deep Learning.

Time:

5:35 PM - 5:55 PM

Agenda Track No.:

Track 1

Session Type:

Track

Session Stage:

Hardware & Systems