Aspect-Oriented Designs

Developers must acquire considerable knowledge and expertise to effectively program heterogeneous platforms. Heterogeneous platforms, in addition to multi-core CPUs, may include an arbitrary number of specialised resources, such as FPGAs and GPGPUs. In the context of heterogeneous platforms, developers must be aware of a number of architectural details including: the different types of processing cores which may exhibit various levels of complexity, the communication topology between processing elements, the hierarchy between different memory systems, and built-in specialised architectural features. All these elements influence how programs are developed and deployed:

  • Hardware/software partitioning - selecting parts of the code that are better suited for different processing elements;
  • Locality - maximising data reuse;
  • Layout - keeping data close to computation;
  • Copying - minimising data movement;
  • Exploiting architecture features - (a) independent computational units, and (b) specialised functional units and instructions.

There are two common programming approaches that address heterogeneity: (1) a uniform programming framework supporting a single programming language and semantics to target different types of computational resources, such as CPUs, GPGPUs and FPGAs; (2) a hybrid programming framework in which developers manually partition and map parts of their application to different types of processing elements using the most suitable semantics and languages.

In HARNESS, we wish to attain the programming benefits of the above two approaches:

  • Design Goal 1: support a single programming language to allow applications to be implemented, optimised and mapped in a uniform target agnostic manner;
  • Design Goal 2: support multiple semantics such that it is possible for developers to express alternative descriptions of an algorithm that can be better mapped and optimised to architectures that support, for instance, fine-grained and data parallel computations;
  • Design Goal 3: minimise application knowledge about platform architectural details.

Approach

In HARNESS, we are developing the Uniform Heterogeneous Programming approach, which aims to support the above design goals using two complementary languages: FAST[3] and LARA[1,4]:

With FAST, developers (cloud tenants) use a single software language (C99) to implement their applications with the possibility of using multiple semantics to describe alternative versions of the same algorithm. In the figure below, we present two functions that implement the moving average using imperative and dataflow semantics [4], respectively. With dataflow semantics, C99 code is translated into functional units that are mapped onto the FPGA area to realize a deep pipelined architecture, in which data is computed in parallel and the output is forwarded synchronously to the next functional unit. We believe FAST simplifies not only the compilation and optimisation design-flow using a single code base, but it also simplifies the programming effort when targeting specialised computation resources. However, programmers are still required to understand the intricacies of the hardware architecture to optimise their designs. To minimise the learning curve, we adopt the LARA aspect-oriented programming methodology which we explain next.

FAST programming semantics

FAST programming semantics: imperative vs dataflow

With LARA, hardware infrastructure experts (for instance, working on behalf of cloud providers) can codify domain specific knowledge into special programs called aspects which analyse and manipulate (naive) FAST programs. In particular, a process called weaving combines, in an automated fashion, non-functional (aspects) and functional concerns (FAST programs) leading to the desired implementation. There are several benefits to the weaving process as pursued by a LARA-guided design-flow (see figure below):

  1. It allows non-functional concerns to be developed and maintained independently from the original application source code. This decoupling promotes a clean separation between the algorithmic specification (FAST programs) and non-functional description (LARA aspects) leading to a cleaner and thus easier to maintain source code base;
  2. LARA aspects can be developed independently from the application sources and therefore reused in the context of multiple application codes. This reuse of aspects allows non-expert developers to exploit specialized transformations and strategies geared towards specific target architectures, thus substantially promoting productivity and portability across similar target architectures;
  3. The ability to specify generic and parameterisable aspects in LARA is particularly useful for describing hardware- and software-based transformation patterns as well as templates, thus facilitating design-space exploration.
Benefits of the LARA aspect-oriented design-flow

Benefits of the LARA aspect-oriented design-flow

Targeting Dataflow Engines

We implemented an initial prototype of the uniform heterogeneous programming tools targeting dataflow engines[2]. Our toolchain (see figure below) includes: (1) the FAST compiler, which translates FAST functions implemented with dataflow semantics into MaxJ code[5]; (2) a LARA weaver which analyses and automatically manipulates FAST programs based on aspects; and a repository of aspects that address specific concerns of dataflow computing. We report four types of aspect concerns for dataflow computing that are being applied to FAST programs:

  1. system aspects capture transformations and optimisation strategies that affect the whole application, such as hardware/software partitioning and run-time reconfiguration;
  2. architectural aspects focus on low-level design optimisations that can be applied to designs in FAST to improve timing, resource usage or exploit specialised architectural features;
  3. exploration aspects deal with strategies that generate multiple designs to find an optimal implementation based on user-specified constraints;
  4. development aspects relate to concerns that have an impact on the design process, such as debugging, kernel simulation and improving compilation speed.

Extending the DFE design flow with FAST and LARA

Examples

We describe two examples, one for monitoring and one for operator mapping.

A Monitoring example

To find potential hotspots in the application, we can use the LARA aspect below. With this aspect, the weaver can automatically instrument any C application to self-monitor its innermost loops at run-time, as they are natural candidates for dataflow-based acceleration. In particular, this monitoring aspect can compute the following information for every innermost loop:

  1. the average number of times it has been executed
  2. the average number of iterations
  3. the loop average execution time
  4. the loop iteration average execution time
For this purpose, we use a simple monitoring API, to record the frequency of execution and the loop iteration and execution time:
  • monitor_instanceI and monitor_instanceE: mark the execution before and after a loop respectively
  • monitor_iterI and monitor_iterE: mark the execution beginning and end of a loop iteration respectively

Monitoring Aspect

Operator mapping

To provide architectural details to FAST designs, such as mapping operators to DSP blocks, we can use the FAST pragma, shown below, at the top of a statement or code block.  This FAST balancing pragma provides fine-grained control over the mapping of computation to either DSPs or LUT/FF pairs:

#pragma FAST balanceDSP:balanced
{
   x = x ∗ y ;
   x++;
}
The balancing parameter corresponds to the degree of utilisation of DSP blocks in a statement. The aspect shown below is the strategy for balancing DSP blocks in every statement of an application. Instead of adding the above pragma manually, we provide a set of rules that define where to place the balanceDSP pragma. In this example, we established the rule that full DSP block utilisation is applied to any statement that has 5 or more multipliers and adders, balanced if 3 or more multipliers, and no DSP utilisation otherwise.

DSP balancing aspect

Evaluation

The DSP balancing aspect, shown above, allows to explore the resource trade-offs of implementing arithmetic operations in either DSPs or LUTs and FFs,  and helps to avoid over mapping on DSPs for arithmetic intensive applications. Highlighted below we show the effects of using DSPs in different RTM designs, varying the parallelism and numerical precision. In particular, fully utilising DSPs reduces LUT/FF area as well as place&route time, but may limit parallelism.

RTM design results

Further Reading

  1. J. M. P. Cardoso, T. Carvalho, J. G. F. Coutinho, W. Luk, R. Nobre, P. Diniz, and Z. Petrov, LARA: an aspect-oriented programming language for embedded systems. In 11th ACM Conference on Aspect-Oriented Software Development (AOSD), 2012
  2. O. Pell, O. Mencer, K. H. Tsoi, and W. Luk. Maximum performance computing with dataflow engines. In High-Performance Computing Using FPGAs, W. Vanderbauwhede and K. Benkrid, Eds. Springer-Verlag, 2013
  3. P. Grigoras, X. Niu, J. G. F. Coutinho, W. Luk, J. Bower, and O. Pell. Aspect driven compilation for dataflow designs. In 24th IEEE Conference on Application-Specific Systems, Architectures and Processors (ASAP), 2013
  4. J. M. P. Cardoso, J. G. F. Coutinho, T. Carvalho, P. Diniz, Z. Petrov, W. Luk, and F. Goncalves. Performance-driven instrumentation and mapping strategies using the LARA aspect-oriented programming approachSoftware: Practice and Experience, December,  2014.
  5. The OpenSPL consortium