source: hugging face blog: building blocks for foundation model training and inference on aws

level: technical

scaling foundation models now spans pre-training, post-training, and test-time compute, each demanding tightly coupled accelerators, low-latency networks, and fast shared storage. aws provides multi-gpu instances from the p5 and p6 families, with nvidia h100, h200, b200, and b300 gpus offering up to 2.25 petaflops of dense bf16 tensor throughput and up to 288 gb of hbm3e memory per gpu. within a node, nvlink and nvswitch deliver up to 14.4 tb/s aggregate bandwidth for gpu-to-gpu communication. across nodes, elastic fabric adapter (efa) provides os-bypass rdma networking with up to 800 gb/s aggregate bandwidth on p6 instances, enabling efficient collective operations at scale.

storage is tiered: local nvme ssds handle hot data, amazon fsx for lustre offers a shared high-throughput parallel file system, and amazon s3 provides durable persistence with lazy loading and automatic checkpoint export. for extreme scale, ec2 ultraclusters provision thousands of instances on a petabit-scale nonblocking network. ultraservers extend the nvlink domain beyond a single instance, with configurations like the p6e-gb200 nvl72 exposing up to 72 blackwell gpus and 13.4 tb of hbm3e in one nvlink domain, reducing cross-node communication for large models.

resource orchestration relies on slurm or kubernetes to schedule jobs across hundreds of accelerators. slurm uses a modular plugin architecture for topology-aware placement, gpu scheduling via generic resources, and backfill scheduling to maximize utilization. kubernetes offers similar capabilities through custom schedulers and operators. observability is layered across the stack using prometheus for metrics and grafana for visualization, helping diagnose performance issues in distributed training and inference workflows.

why it matters: understanding these building blocks helps ml engineers select and configure aws infrastructure for efficient large-scale training and inference, reducing bottlenecks in communication and storage.


source: hugging face blog: building blocks for foundation model training and inference on aws