Declarative GPU Cluster Orchestration for Fault-Tolerant Distributed Training on Google Cloud | Kisaco Research

Distributed training jobs are brittle; a single node failure can halt progress and waste expensive GPU cycles. This technical demo dives into Cluster Director, focusing on how engineers can automate resilient, large-scale GPU infrastructure. We'll start with a declarative YAML configuration to define and provision a multi-node GPU cluster, optimized with the ideal network topology for NCCL communication. The core of the demo will be a live failure simulation. You will see Cluster Director automatically detect a preempted node, perform remediation, and maintain the integrity of the running workload with minimal disruption.

Sponsor(s): 
Google
Session Type: 
General Session (Presentation)