Designing Transportable, High-Performance AI Systems for the Rugged Edge
System design requirements are well understood for high-performance artificial intelligence applications destined to reside in enterprise or cloud data centers. Data centers are specifically designed to provide a clean, cool environment with stable and standard power with no need to worry about vibration or shock loads.
Putting the most sophisticated AI computing capability in the field in military and industrial applications is a whole different story. Today, desire for placing this capability outside the data center directly at the edge is growing. Being able to fully exploit AI benefits in real time near the data and action is highly valuable. Many of these applications must reside in mobile platforms that not only carry out their mission but must also operate in an autonomous or semi-autonomous way. These applications are driving demand for data center class AI performance in systems that now must operate in environments that are far more challenging.
Whether deployed in the tactical arena on terrestrial vehicles, aircraft, or on-board ships, or for industrial uses with autonomous mining, construction, or farming equipment, AI systems need to now meet unique harsh operating conditions. Exploring three specific areas of environmental divergence from data centers is illustrative of the unique challenges system designers face: vibration, temperature, and power.
Deploying AI enabled intelligence, surveillance and reconnaissance (ISR) systems on US Navy P-8 aircraft requires designs that mitigate the vibration profile of the aircraft. This will differ from deployment on a propeller-based P-3, or an AI remote sensor aggregation platform on a drone or an AI threat detection system on a helicopter. These airborne specific vibration profiles will also differ in frequencies and amplitudes from profiles for mobile radar stations, autonomous tanks, or mobile data centers which, in turn, differ from maritime requirements for ISR, threat detection and autonomous operation. Effectively designing for these environments is critical for reliable operation.
Unconstrained vibration causes wear and potential failure on all electronic connections, add-in cards, memory DIMMs, heat sinks, and power supply components. Appropriate design techniques include finely tuned finite element simulations to identify problematic harmonics in the chassis design and mitigate through combinations of strategic strengthening, tie downs, constraints, and bonding.
These chassis ruggedizations must be done efficiently to not sacrifice the overall weight and size requirements of edge applications. For military applications this means designing and testing to appropriate MIL-STD-810G profiles and selecting the appropriate materials and manufacturing techniques. Designs using milled aluminum frames can provide localized rigidity to avoid key resonance modes where required while keeping the size and weight minimized.
When moving out of the data center, designing for unstable temperature conditions also presents difficult design problems when trying to take advantage of the latest technology components used to deliver uncompromising AI capability. The challenge is compounded with the fact that high-performance components that deliver the most value to AI generate a huge amount of heat themselves because of the power they require. The latest Al-focused GPUs consume up to 500W of power each, with next generation GPUs projected to increase power consumption to 700W. Other key components in these systems including CPUs, memory, and switch chips also generate significant heat. For long-term reliable operation, these components must be kept within their respective operating temperature thresholds.
Tactical theater environments can vary from extreme cold to extreme hot ambient temperatures. Additionally, changes in altiude and humidity can impact the efficiency of cooling strategies. Different applications will also have different cooling infrastructures in which they will reside with air, liquid, or plenum cooling as options. Plug-and-play air-cooled systems include orchestration of fans, baffling, heat sinks and enclosure design to ensure the high-performance components can operate efficiently at maximum performance.
For edge applications with in-person operation, the cooling design needs to also accommodate human factors including noise restrictions. While undesirable, system integrators commonly compromise the performance of the system to achieve the cooling requirements. Many systems deployed in edge computing applications today avoid use of the latest high-performance components because they do not have the ability to solve the cooling issues.
A third area where system design for high-performance edge AI differs significantly from data center systems is input power. Data centers often pride themselves in providing reliable and stable power, typically at 110-220 VAC. Although edge systems can be capable of supporting this standard input, many AI transportable vehicles have power sources which vary greatly in voltage and frequency. Tactical terrestrial vehicles, including mobile data centers, will often provide DC power at a range of voltages from 270V to 48V to 24V. In some autonomous truck applications, 12 VDC is desirable.
Airborne systems will often need to be able to support three phase 400-800 Hz AC power. Compounding the design complexity is the overall power requirements of these high-end AI platforms. With four or even eight of the latest AI GPUs and other highend components, systems can require up to 3000W of redundant power. These levels far exceed what has been traditionally deployed in edge applications.
Design of power supplies also needs to be cognizant of the power output unique to high-end GPUs. Leading boards supporting four- and eight-way GPUs with high-speed interconnectivity require 54V input. All of these considerations require divergence from traditional datacenter system design. The power requirements of transportable AI systems vary widely and system designs need to be flexible and modular to adapt to the power source environment in which they reside.
Efforts to address transportable AI applications to date have constrained the performance delivered, with tradeoffs made in component selection to lower power, less-sophisticated compute elements to meet the environmental challenges. Lower compute capacity GPUs and CPUs require less power and thereby generate less heat and are easier to cool. The result is a limit on the capability of the AI applications deployed in the field. If these design challenges can be addressed, the full performance of today's leading edge AI hardware can be unleashed at the edge, increasing the effectiveness of the next generation of military and industrial applications.
A case study in design for a high-performance, AI transportable system is the Rigel Edge Supercomputer recently announced by One Stop Systems (OSS). It is a small form factor (4U half rack) AI compute platform targeted at military applications on land, sea, and air as well as harsh-environment industrial applications including autonomous commercial trucking, mining, and construction equipment. It includes a super-dense GPU compute capability with 4 NVIDIA A100 SXM4 GPUs with full mesh NVLink, the latest AMD PCIe Gen4 Epyc CPU, up to 2TB of system memory, and 4 IO slots of Gen 4 PCIe Gen 4 ×16 lanes, all guaranteeing the delivery of full state-of-the art AI performance.
In addition to providing high performance, Rigel is designed to operate at the harsh edge. It addresses vehicular operational vibration requirements through a lightweight milled aluminum chassis frame, custom ruggedized power supplies, add-in card restraining mechanisms, and strategically placed stiffening elements. It addresses the cooling requirements with a custom heat sink design based on CFD analysis that fully cools the GPUs and other critical components in ambient temperatures up to 35 °C at 10,000 ft altitude and 40 °C at sea level. It includes a customized three-tier airflow structure, with baffling and airflow segregation for optimal forced convection cooling. The power subsystem accommodates a total system capacity of 3000W fully redundant at a variety of power inputs including 120/240 VAC at 60Hz or three phase at 400-800Hz or 270/48 VDC. The power supplies provide fully redundant 54V and 12V output.
With new platforms like Rigel coming online, the full capabilities of rapidly advancing AI technology no longer need to be constrained to the data center but can be deployed where they are needed at the rugged edge.
This article was written by Tim Miller, Vice President, Product Marketing, One Stop Systems (Escondido, CA). For more information, visit here .