Scaling AI Infrastructure: Strategies for Resilient Fleet Operations | Kisaco Research

What does it take to run one of the world's largest AI supercomputers? As artificial intelligence workloads grow exponentially, operating a hyperscale AI cloud fleet demands new strategies for resilience, efficiency, and operational excellence. This session explores Microsoft’s approach to scaling infrastructure for 100X growth, focusing on the intersection of system innovation and advanced fleet management.

Session Topics: 
Storage
Speaker(s): 

Author:

Dharmesh Patel

Partner, Manufacturing Quality Engineering
Microsoft

Dharmesh Patel serves as the General Manager and head of the Quality Engineering Organization at Microsoft. In this capacity, he oversees the AI Fleet Quality team to ensure AI capacity, stability, and reliability throughout the hardware supply chain from manufacturing to data centers. His responsibilities include enabling Microsoft to scale AI capacity while maintaining high hardware quality standards across all stages of product development from concept through mass production. With nearly twenty years of experience in managing complex products and promoting process excellence within data centers, Dharmesh is a recognized leader in his field.

Dharmesh Patel

Partner, Manufacturing Quality Engineering
Microsoft

Dharmesh Patel serves as the General Manager and head of the Quality Engineering Organization at Microsoft. In this capacity, he oversees the AI Fleet Quality team to ensure AI capacity, stability, and reliability throughout the hardware supply chain from manufacturing to data centers. His responsibilities include enabling Microsoft to scale AI capacity while maintaining high hardware quality standards across all stages of product development from concept through mass production. With nearly twenty years of experience in managing complex products and promoting process excellence within data centers, Dharmesh is a recognized leader in his field.

Author:

Prabhat Ram

Partner, Software Architect
Microsoft

Prabhat leads the AI Customer Experience team within Microsoft Azure. He is responsible for operating AI Training supercomputers for OpenAI and other strategic customers. He holds a master’s in Computer Science from Brown University and a PhD from the Earth and Planetary Sciences department at U.C. Berkeley.

In addition to coauthoring more than 150 papers on computer and domain sciences, his work has been recognized throughout the industry including being awarded the 2018 ACM Gordon Bell Prize for his team’s work on Exascale Deep Learning. 

Prabhat Ram

Partner, Software Architect
Microsoft

Prabhat leads the AI Customer Experience team within Microsoft Azure. He is responsible for operating AI Training supercomputers for OpenAI and other strategic customers. He holds a master’s in Computer Science from Brown University and a PhD from the Earth and Planetary Sciences department at U.C. Berkeley.

In addition to coauthoring more than 150 papers on computer and domain sciences, his work has been recognized throughout the industry including being awarded the 2018 ACM Gordon Bell Prize for his team’s work on Exascale Deep Learning. 

Time: 
5:35 PM - 5:55 PM
Agenda Track No.: 
Track 1
Session Type: 
Track
Session Stage: