Covering Scientific & Technical AI | Saturday, November 30, 2024

AI Computing: A Unified Hardware Standard for Mixed Accelerator Platforms 

Today, one need only turn on a TV to be inundated with commercials touting the benefits and potential of AI. Applications that could hardly be envisioned a short time ago are now becoming commonplace, and the outlook is that AI will grow by leaps and bounds. But achieving the promise of AI requires computing platforms that deliver high-performance, robustness and scalability while embracing openness to enable interoperability and to respond more quickly and cost-effectively to market demands.

Helping to assure interoperability and to aid manufactures in meeting the demand for AI systems with enhanced capabilities, the Open Compute Project (OCP) engages numerous partners committed to advancing AI computing technology through the utilization of open specifications — its latest project referred to as Open Accelerator Infrastructure (OAI). Drawing on experience with previous open hardware and software projects, the organization has participants from throughout the computing ecosystem, with most recent efforts successfully focused on advancing accelerator technologies to offer more elegant, streamlined and accessible open specifications for advancing AI computing platforms.

A recent roundtable discussion with leaders from OCP and Baidu involved an exploration of the development and value proposition of OAI that reached some interesting conclusions.

According to Archna Haylock, Community Director, Open Compute Foundation, “Companies today are facing numerous challenges, whether it comes to data center infrastructure, hardware acceleration, or hardware management from the facilities to the rack down to the nodes. What OCP brings to the table is an environment of collaboration to meet these challenges and find a common solution that works across the board and that provides economies of scale to achieve improved efficiencies and cost savings.”

A key objective for OAI was to simplify the design of the accelerator module. The resulting specification is a technical solution whereby manufacturers can design their own products without having to start from scratch. As with other open source software, such as Hadoop, GFS, Linux and others, users can download the code freely and pursue individual development efforts.

In effect, the specification promotes the convergence of different accelerator technologies, such as ASIC, GPU and FPGA, overcoming incompatibility issues and enabling these technologies to perform under unified hardware standards. In this way, users can replace different chips freely, bringing more options to manufacturers and simplifying the supply side of the accelerator industry. The key technological advantages of OAI are:

  • Comprehensive compatibility, which supports current AI accelerators, such as FPGAs, GPUs and ASICs, as well as future generations of heterogeneous technologies.
  • Supports 12V and 54V power supply. The maximum power of 12V power supply is 300W, and the maximum power of 54V power supply is 450W-500W.
  • Supports four interconnected topologies, including HCM (for 8-port & 6-port OAM), FC, combined FC/HCM and 4D Hypercube.

One of the first product offerings to benefit from the specification’s development is the Baidu X-MAN 4.0 - a jointly developed system with Inspur. The evolution of the OAI specification started with the OpenAPI model specification, with contributions from Facebook, Microsoft and Baidu. From that point, it became clear that there was as a need to expand the specification to an infrastructure where the whole rack and system could perform with increased interoperability. Working under the framework of OCP, the OAI subgroup focused on how best to support diversified accelerators. As a result, manufacturers are provided greater choice in an open ecosystem that will ultimately bring benefits to developers and end users of AI applications.

Richard Ding, AI System Architect from Baidu, also commented: “OCP is a very good platform for the people and users and system integrators, as well as chip providers to perform on one stage. For Baidu, OCP was the platform where we could better identify our requirements, discover how we could work together with our partners, even sometimes our competitors, and define a kind of standard that can benefit the entire ecosystem.”

The scope of the OAI subgroup’s work included defining the physical modules that include logical aspects, such as electrical, mechanical, thermal, management, hardware security, physical serviceability, etc., to produce solutions compatible with traditional existing operating systems, allowing for the creation of frameworks for running heterogeneous accelerator applications. Moving forward, there is growing industry consensus that by encouraging the specification’s adoption, and further practical application testing, ongoing advancements in the AI ecosystem can be achieved through standardization.

Conclusion:

The OAI project is built around the concept of designing a modular architecture that can support different accelerators and multi-system scale-up, thereby interconnecting communication very easily. The task ahead is to promote its application and garner increased support from the industry to achieve greater scale, both across the high-performance computing ecosystem, as well as vertical markets. As the standard becomes of more practical significance, its actual application can test the advantages and disadvantages of the specifications so that the standard’s technology can be upgraded to meet real-world computing scenarios based upon AI applications.

Alan Chang is deputy general manager of server product line at Inspur.   

AIwire