The need for scalable and high-performance infrastructure continues to grow exponentially as the AI landscape advances. Our customers rely on Azure AI infrastructure to develop innovative AI-driven solutions, which is why we are delivering new cloud-based AI-supercomputing clusters built with Azure ND H200 v5 series virtual machines (VMs) today. These VMs are now generally available and have been tailored to handle the growing complexity of advanced AI workloads, from foundational model training to generative inferencing. The scale, efficiency and enhanced performance of our ND H200 v5 VMs are already driving adoption from customers and Microsoft AI services such as Azure Machine Learning and Azure OpenAI Service.
“We’re excited to adopt Azure’s new H200 VMs. We’ve seen that H200 offers improved performance with minimal porting effort, we are looking forward to using these VMs to accelerate our research, improve the ChatGPT experience, and further our mission.” —Trevor Cai, head of infrastructure, OpenAI.
The Azure ND H200 v5 VMs are architected with Microsoft’s systems approach to enhance efficiency and performance, and feature eight NVIDIA H200 Tensor Core GPUs. Specifically, they address the gap due to GPUs growing in raw computational capability at a much faster rate than the attached memory and memory bandwidth. The Azure ND H200 v5 series VMs deliver a 76% increase in High Bandwidth Memory (HBM) to 141GB and a 43% increase in HBM Bandwidth to 4.8 TB/s over the previous generation of Azure ND H100 v5 VMs. This increase in HBM bandwidth enables GPUs to access model parameters faster, helping reduce overall application latency, which is a critical metric for real-time applications such as interactive agents. The ND H200 V5 VMs can also accommodate more complex Large Language Models (LLMs) within the memory of a single VM, improving performance by helping users avoid the overhead of running distributed jobs over multiple VMs.
The design of our H200 supercomputing clusters also enables more efficient management of GPU memory for model weights, key-value cache, and batch sizes, all of which directly impact throughput, latency and cost-efficiency in LLM-based generative AI inference workloads. With its larger HBM capacity, the ND H200 v5 VM can support higher batch sizes, driving better GPU utilization and throughput compared to ND H100 v5 series for inference workloads on both small language models (SLMs) and LLMs. In early tests, we observed up to 35% throughput increase with ND H200 v5 VMs compared to the ND H100 v5 series for inference workloads running the LLAMA 3.1 405B model (with world size 8, input length 128, output length 8, and maximum batch sizes – 32 for H100 and 96 for H200). For more details on Azure’s high performance computing benchmarks, please read more here or visit our AI Benchmarking Guide on the Azure GitHub repository for more details.
The ND H200 v5 VMs come pre-integrated with Azure Batch, Azure Kubernetes Service, Azure OpenAI Service and Azure Machine Learning to help businesses get started right away. Please visit here for more detailed technical documentation of the new Azure ND H200 v5 VMs.
By: Nitin Nagarkatte, Principal Product Manager, Azure HPC+AI
Originally published at: Microsoft Azure Blog
Source: zedreviews.com