Using High-Performance Computing Clusters to Support Fine-Grained Parallel Applications

A custom-built serial board connects FPGAs to accelerate performance.

A heterogeneous cluster comprised of host processors and field programmable gate arrays (FPGAs) was used to accelerate the performance of parallel fine-grained applications using a direct FPGA- to-FPGA communications channel. The communications channel is implemented with an all-to-all board that attaches directly to the FPGA boards via their I/O interface. Parallel Discrete Event Simulation (PDES) was used to demonstrate the acceleration performance.

Architecture of the Individual HPCC Node. The I/O card is used to interconnect the FPGAs directly to each other using a custom-built all-to-all serial board. This board provides connectivity from every node to every other node concurrently using a dedicated serial line.
PDES is an approach to parallelizing sim ulation to increase its performance and ca pacity, allowing the simulation of bigger, more detailed models, and more interesting scenarios in a given time. PDES underlies several areas of interest for the De part ment of Defense, including war games, planning and decision-making, and complex system design and analysis including both hardware and software systems.

In previous efforts to accelerate the performance of PDES, it was found that the communication subsystem is a major bottleneck in PDES performance. In addition, initial efforts in exploiting the FPGAs on a Heterogeneous High Per - formance Cluster (HHPC) to accelerate the performance of a PDES simulation were reported. Using FPGA boards to accelerate the performance of some critical simulation subsystems was the goal of the study. Since PDES is a fine-grained operation, and the communication with the FPGA board is expensive, it is almost impossible to use the FPGAs to optimize the simulation kernel.

In response to this limitation, an alternative channel for the FPGAs to communicate without having to interrupt the primary host processor was created. To achieve this, a serial all-to-all connector board that provides direct, low-bandwidth, low-latency connectivity among the FPGA boards was designed. This board provided a channel for the FPGAs to communicate directly, potentially greatly improving the performance of fine-grained applications with components of the computation residing on the FPGAs.

To demonstrate such an application, the Global Virtual Time computation was used as a target for FPGA implementation. Each node provides local time and message counts when it enters GVT computation phase and whenever transit message count changes to the FPGA board. The boards communicate among each other to detect the global messages in transit count. When that reaches 0, they compute the minimum of the local times and broadcast it to all the host processors.

The all-to-all board was tested for functionality and performance to set the baseline physical rate on which it can communicate. Further, support for communication using the all-to-all board had to be developed: the equivalent for the link layer for this communication channel.

The HHPC is a Beowulf cluster made of off-the-shelf PCs (featuring dual Intel Xeon processors) interconnected via a Gigabit Ethernet Network and a Myrinet network. In addition, each node has an (AMD) Wildstar II FPGA board on the PCI bus. The Wildstar has a Xilinx Virtex II FPGA, some DRAM and SRAM banks, and an LVDS I/O card. The I/O card was used to interconnect the FPGAs directly to each other using a custom-built all-to-all serial board. This board provides connectivity from every node to every other node concurrently using a dedicated serial line. This results in a low-latency but low-bandwidth communication channel among the FPGAs.

Without this connectivity, all communication must go through the communication fabric at a latency ranging at about 10 microseconds (for the Myrinet) to several tens of microseconds for Gigabit Ethernet. Typically, FPGA boards are used to accelerate sequential or highgranularity parallel applications that have high data parallelism or unusual data paths. PDES does not fit this profile: it is fine-grained and does not, in general, require high data parallelism.

This work was done by Nael Abu-Gazaleh of the State University of New York – Binghamton for the Air Force Research Laboratory.

AFRL-0118



This Brief includes a Technical Support Package (TSP).
Document cover
Using High-Performance Computing Clusters to Support Fine-Grained Parallel Applications

(reference AFRL-0118) is currently available for download from the TSP library.

Don't have an account?



Magazine cover
Defense Tech Briefs Magazine

This article first appeared in the April, 2009 issue of Defense Tech Briefs Magazine (Vol. 3 No. 2).

Read more articles from this issue here.

Read more articles from the archives here.


Overview

The document titled "Using Heterogeneous High Performance Computing Cluster for Supporting Fine-Grained Parallel Applications" is a final technical report published by the Air Force Research Laboratory in October 2006. The primary objective of the project was to investigate a new infrastructure designed to support basic communication for parallel applications, specifically through the lens of Parallel Discrete Event Simulation (PDES).

The project was initiated with the fabrication of an all-to-all communication board, which became available a few months into the research. A graduate research assistant, David Curren, was employed to work on the project, dedicating 20 hours a week, with additional support during the summer. The initial phase of the project focused on training the research assistant in various relevant technologies, including VHDL, physical design processes, and the PDES simulator.

A significant part of the project involved developing a methodology and test scripts for the all-to-all board, which is characterized by its large physical size and numerous wires. The design anticipated issues related to cross-talk, allowing for tolerances up to 200 MHz. However, testing revealed that cross-talk occurred at data rates above 51 MHz, with some sparse data rates functioning up to 80 MHz. Ultimately, continuous communication with randomly generated patterns across all wires could only be sustained at 51 MHz or lower.

The report also delves into the background of PDES, explaining its operational mechanics, where events are processed in a queue based on simulation time. The simulation advances as events are processed, and it concludes when there are no more events or a predetermined time is reached. The document emphasizes the need for more effective fine-grained communication and self-monitoring within the context of heterogeneous high-performance computing (HHPC) environments.

In summary, this report outlines the challenges and findings related to the development of communication support for parallel applications, highlighting the importance of addressing cross-talk issues in high-speed data transmission. It serves as a foundational study for future application developers looking to leverage the capabilities of advanced computing infrastructures for fine-grained parallel processing.