In this session, we will explore the end-to-end workflow of managing foundation model (FM) development on Amazon SageMaker HyperPod. Our discussion will cover both distributed model training and inference using frameworks like PyTorch and KubeRay. Additionally, we will dive into operational aspects, including system observability and resiliency features for scale and cost-performance using Amazon EKS on SageMaker HyperPod. By the end of this hands-on session, you will gain a robust understanding of training and deploying FMs efficiently on AWS. You will learn to leverage cutting-edge techniques and tools to ensure high performance, reliable, and scalable FM development.
Location: Room 206
Duration: 1 hour

Mark Vinciguerra
Mark Vinciguerra is an Associate Specialist Solutions Architect at Amazon Web Services (AWS) based in New York. He focuses on Generative AI training and inference, with the goal of helping customers architect, optimize, and scale their workloads across various AWS services. Prior to AWS, he went to Boston University and graduated with a degree in Computer Engineering.

Aravind Neelakantan
Aravind Neelakantan is a Specialist Solutions Architect at Amazon Web Services (AWS), focusing on Generative AI training and inference solutions. His expertise spans artificial intelligence and high-performance computing, with research interests in performance analysis and optimization of AI workloads on heterogeneous architectures. Prior to AWS, Aravind developed reference architectures for large-scale AI clusters and specialized in AI inference performance optimization using hardware accelerators. He holds a Ph.D. from the University of Florida, specializing in heterogeneous HPC architecture.

Aman Shanbhag
Aman Shanbhag is a Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services (AWS), where he helps customers and partners with deploying ML training and inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in computer science, mathematics, and entrepreneurship.
AWS
Website: https://aws.amazon.com/
Since 2006, Amazon Web Services has been the world’s most comprehensive and broadly adopted cloud. AWS has been continually expanding its services to support virtually any workload, and it now has more than 240 fully featured services for compute, storage, databases, networking, analytics, machine learning and artificial intelligence (AI), Internet of Things (IoT), mobile, security, hybrid, media, and application development, deployment, and management from 114 Availability Zones within 36 geographic regions, with announced plans for 12 more Availability Zones and four more AWS Regions in New Zealand, the Kingdom of Saudi Arabia, Taiwan, and the AWS European Sovereign Cloud. Millions of customers—including the fastest-growing startups, largest enterprises, and leading government agencies—trust AWS to power their infrastructure, become more agile, and lower costs. To learn more about AWS, visit aws.amazon.com.






















