Declarative GPU Cluster Orchestration for Fault-Tolerant Distributed Training on Google Cloud

Distributed training jobs are brittle; a single node failure can halt progress and waste expensive GPU cycles. This technical demo dives into Cluster Director, focusing on how engineers can automate resilient, large-scale GPU infrastructure. We'll start with a declarative YAML configuration to define and provision a multi-node GPU cluster, optimized with the ideal network topology for NCCL communication. The core of the demo will be a live failure simulation. You will see Cluster Director automatically detect a preempted node, perform remediation, and maintain the integrity of the running workload with minimal disruption.

Speaker(s):

Author:

Ilias Katsardis

Senior Product Manager

Google Cloud

Ilias Katsardis is a Senior Product Manager based in Sunnyvale, CA, driving the future of AI infrastructure at Google Cloud. He is responsible for Cluster Director and the Cluster Toolkit, two key components of Google's supercomputing architecture. Passionate about making large-scale AI and HPC more accessible, Ilias focuses on creating solutions that automate complex configurations and provide a seamless user experience. His work enables researchers and developers to spend less time on infrastructure management and more time on scientific breakthroughs. With a rich background that includes roles at Cray Inc. and ClusterVision, along with founding two tech startups, Ilias brings over 15 years of deep industry expertise to his role.

Author:

Abhijith Prabhudev

Product Manager

Google Cloud

Abhijith Prabhudev is a Product Manager based in Sunnyvale, CA, leading the AI infrastructure observability and monitoring at Google Cloud. He is responsible for GPU infrastructure reliability, monitoring and resiliency capabilities. His work enables researchers and developers to spend less time on infrastructure management and more time on building and training AI models. With over 15+ years of infrastructure industry experience that includes leading VMware vSphere product team and a full stack engineer, Abhijith is passionate about solving infrastructure problems that hinder developer and administrator productivity.

Session Type:

General Session (Presentation)