Covering Scientific & Technical AI | Sunday, November 24, 2024

New to OCI AI Infrastructure: Midrange Bare Metal Compute with NVIDIA L40S and VMs with NVIDIA H100/A100 

Aug. 2, 2024 -- Oracle Cloud Infrastructure (OCI) recently announced NVIDIA L40S GPU bare-metal instances available to order and the upcoming availability of a new virtual machine accelerated by a single NVIDIA H100 Tensor Core GPU. In this recent blog post, Oracle's Akshai Parthasarathy and Sagar Zanwar discuss the new offerings and how they aim to bridge the gap for enterprise use cases in medium-scale AI training and inference, providing detailed insights into the technical capabilities and advantages of the new OCI shapes.


Oracle has a history of building cutting-edge infrastructure. We pioneered RDMA-based cluster networking with Exadata v1 more than 15 years ago. We were the first to introduce bare metal compute shapes among hyperscalers in 2016.

We underpin the largest AI clusters with OCI Supercluster for tens of thousands of NVIDIA A100 and H100 Tensor Core GPUs. We were also able to enable small virtual desktop, AI inference, and AI training workloads on a single node with 1-4 NVIDIA A10 Tensor Core GPUs. However, there was a gap we needed to fill for enterprise use cases between these two levels of scale from very small and very large deployments for medium-scale AI training and inference, as shown below.

Credit: Oracle

The two new shapes we’re announcing are:

  • BM.GPU.L40S.4, “L40S Bare Metal:” to support up to 3,840 GPUs in an OCI Supercluster, with 1,466 TFLOPS per NVIDIA L40S GPU.
  • VM.GPU.A100.1 and VM.GPU.H100.1, “A100 VM” and “H100 VM” respectively: to support a single GPU in a VM form factor for up to 3,958 TFLOPS per NVIDIA H100 GPU.

OCI Bare Metal Compute with Four NVIDIA L40S GPUs

Orderable today—the BM.GPU.L40S.4 bare metal compute shape features four NVIDIA L40S GPUs, each with 48GB of GDDR6 memory. This shape includes local NVMe drives with 7.38TB capacity, 4th Generation Intel Xeon CPUs with 112 cores, and 1TB of system memory. With this addition, OCI offers the most options for bare metal shapes among public cloud hyperscalers. These shapes eliminate the overhead of any virtualization for high-throughput and latency-sensitive AI/ML workloads. The accelerated compute shape features NVIDIA Bluefield-3 DPUs for improved server efficiency, offloading data center tasks from CPUs to accelerate networking, storage, and security workloads. The use of NVIDIA Bluefield-3 DPUs supports OCI’s strategy of off-box virtualization across its entire fleet.

OCI Supercluster’s ultralow latency networking combines with NVIDIA L40S for training and inferencing of LLMs at midrange scalability. OCI’s cluster network uses RDMA over Converged Ethernet Version 2 (RoCE v2) on top of NVIDIA ConnectX RDMA NICs to support high-throughput and latency-sensitive workloads.The BM.GPU.L40S.4 instance can also be used as a standalone virtual workstation with four NVIDIA L40S GPUs. These midrange clusters are supported with 800 Gb/sec of internode bandwidth as shown below.

Comparison of OCI shapes for NVIDIA A10, NVIDIA L40S, and NVIDIA H100 GPUs

BM.GPU.A10.4  BM.GPU.L40S.4 (New)  BM.GPU.H100.8 
Form factor Bare metal Bare metal Bare metal
Hourly price ($ per GPU)  $2 $3.50 $10
Performance (TFLOPS)* 250

1x

1,466

5x

3,958

15x

Scalability on OCI (# GPUs)  4 (per node) 3,840 (per cluster) 16,384 (per cluster)
Cluster network (in Gbps)  N/A 800 Gbps 3,200 Gbps

* FP16 for NVIDIA A10 and FP8 for NVIDIA L40S and NVIDIA H100

OCI engineers have helped Beamr, a world leader in content adaptive video solutions that specialize in transforming videos into smaller, faster, and lower cost versions without compromising quality, launch BeamrCloud in just four months. They elaborate on their use of OCI:

“We chose OCI AI infrastructure with bare metal instances and NVIDIA L40S Tensor Core GPUs for 30% more efficient video encoding. Videos processed with BeamrCloud on OCI will have up to 50% reduced storage and network bandwidth consumption, speeding up file transfers by 2x and increasing productivity for end-users. Beamr will provide OCI customers video AI workflows, preparing them for the future of video,” said Sharon Carmel, CEO of Beamr Cloud.

OCI Compute VMs with One NVIDIA H100 GPU and One NVIDIA A100 GPU

We will soon offer a compute virtual machine shape featuring a single NVIDIA H100 GPU with 80GB of HBM3 memory and NVIDIA A100 GPU with 40GB/80GB of HBM2e memory. The VM.GPU.H100.1 shape also includes 2x3.84TB of NVMe drive capacity, 13 cores of 4th Gen Intel Xeon processors, and 246GB of system memory, making it well-suited for a range of AI tasks.

This new offering provides an effective platform for smaller workloads and LLM inference, and with the NVIDIA H100 GPU's Transformer Engine and FP8 support, it can allow large models to be quantized and run efficiently on a single GPU.

Like other NVIDIA-accelerated shapes on OCI, this shape is compatible with OCI Kubernetes Engine (OKE) and NVIDIA operator for Kubernetes. NVIDIA Inference Microservices (NIM), part of the NVIDIA AI Enterprise software platform, and other container packages from NVIDIA GPU Catalog (NGC) can be seamlessly deployed on OKE.

Altair is transforming enterprise decision making by leveraging the convergence of simulation, high performance computing, and artificial intelligence. OCI has helped the company design apps for use with high speed GPUs. Their use of OCI is elaborated below:

“Oracle Cloud’s bare metal compute with NVIDIA H100 and A100 Tensor Core GPUs, low-latency OCI Supercluster, and high-performance storage delivers up to 20% better price-performance for Altair’s computational fluid dynamics (CFD) and structural mechanics solvers. We look forward to leveraging these GPUs with virtual machines for the Altair Unlimited virtual appliance,” said Yeshwant Mummaneni, Chief Engineer for Data Management and Analytics at Altair.

OCI Compute with NVIDIA GH200 Superchip

Finally, the BM.GPU.GH200 compute shape is available for customer testing. It features the NVIDIA Grace Hopper Superchip and NVIDIA NVLINK C2C, a high-bandwidth cache-coherent 900 GB/s connection between the NVIDIA Grace CPU and Hopper GPU that provides over 600GB of accessible memory, enabling up to 10X higher performance for AI and HPC workloads. Customers interested in the NVIDIA Grace architecture and upcoming NVIDIA Grace Blackwell Superchip can reach out to OCI to get access.

Partnership

All of these shapes can be combined with NVIDIA AI Enterprise, microservices that accelerate data science pipelines and streamline development and deployment of Generative AI. Oracle offers several services to deploy and manage containers, including OCI Kubernetes Engine with is compatible with NVIDIA operators.

Summary and Next Steps

Oracle was the first major cloud provider to offer bare metal instances and will soon scale to 65,536 NVIDIA GPUs in OCI Supercluster while improving manageability for both OCI cloud operators and customers. Together the combination of bare metal instances, off box virtualization, and scalability make OCI AI infrastructure a compelling choice for AI/ML workloads.

Visit us online OCI AI Infrastructure and reach out to sales to learn more.


Source: Akshai Parthasarathy and Sagar Zanwar

AIwire