Dataflow Engines

The spectrum of computation options has at one end a solution provided by application-specific hardware (ASICs) and at the other end a software solution running on general purpose CPUs. Reconfigurable hardware provides the benefits of performance and power consumption usually associated with a hardware system while at the same time providing the flexibility to change the logic to accomplish different tasks. FPGAs are an example of reconfigurable hardware, they are suitable for computation problems due to their large number of resources, variety of logic blocks, reliability and performance.

An FPGA alone is not enough to provide a self-contained and reusable computation resource. It requires logic to connect the device to the host, RAM for bulk storage, interfaces to other buses and interconnects, and circuitry to service the device. This complete system is called a Dataflow Engine (DFE) [1]. The DFE contains an FPGA as the computation fabric and can be easily integrated with a host system or shared with more than one host system.

A DFE operates on streams of data. Data is streamed from memory into the computation fabric where the operations are performed. The operations are performed on the data stream as it flows through the computation fabric, the data being forwarded spatially from one functional unit to another. Results of the computation are streamed back to the memory.

Approach

The programmer wishing to use DFEs should firstly identify the computational bottlenecks which can benefit from the performance improvement delivered by DFEs. Then the relevant parts can be described using a dataflow model [2], the new dataflow implementation is converted into hardware data paths, or DFE Kernels. In the dataflow model the programmer thinks of the implementation as a graph with data flowing through operators. Operations on independent streams within the implementation can be performed in parallel resulting in a reduction in the compute latency. All the dependencies are resolved statically at compile time, so the resulting design can be very parallelized and deeply pipelined.

Together with the DFE dataflow configuration the programmer has to provide run-time software for the host system that will reconfigure the hardware and initiate the transfer of data to the DFE. The OpenSPL consortium is addressing the issues of programming DFEs along with other spacial computing devices.

DFE resource virtualisation and elasticity

One challenge of a cloud platform is to utilise resources in a manner that will both satisfy client requests and give the service provider flexibility. One way a cloud infrastructure can provide DFEs is by using a Maxeler MPC-X system which is a network connected unit containing eight DFEs. These devices are individually available to any host and job on the cloud network, however a static allocation of DFEs from the MPC-X to a job is not very efficient.

Virtualisation of Maxeler DFEs to support resource sharing and elasticity

For this reason the DFE resources are virtualized by providing an interface called a group. A group of DFEs is a collection of engines configured for a particular task. After configuration the engines are all identical. On a single MPC-X system there can be several groups being used concurrently, with each DFE being assigned to one of the groups. To the host the group can be treated as single DFE though tasks submitted to that group could run on any DFE within the group. The on-device scheduling manages the DFEs and tasks to provide an optimal throughput. Using a group removes the need for the host to make complex scheduling decisions, and by the group making scheduling decisions at the hardware optimizes the pipelining of tasks to the DFEs.

Furthermore a group has the ability to accumulate or shed DFEs as the utilization of the groups on the MPC-X changes over time. This dynamic behavior means a group can seamlessly increase in size, and hence increase the throughput, by adding unused or under-utilized DFEs. A DFE joining a group will be reconfigured to be the same as the other DFEs in that group. It is this dynamic behavior of groups that provides elasticity of compute resources. This is of particular benefit to the platform provider because resources are placed where they are most needed and are used when otherwise would be idle. The infrastructure for scheduling within the group and governing group size is all performed at the MPC-X which reduces the load on the cloud management.

Further Reading

  1. O. Pell, O. Mencer, K. H. Tsoi, and W. Luk. Maximum performance computing with dataflow engines. In High-Performance Computing Using FPGAs, W. Vanderbauwhede and K. Benkrid, Eds. Springer-Verlag, 2013
  2. A. DeHon. Compute models and system architectures. Reconfigurable Computing, pp. 91–127. Morgan Kaufmann, 2008