Triton4AIE: Spatial Parallelism Exploration on AMD Versal AI Engines

Contact: Dr Yiren Zhao ([email protected]), Dr Jianyi Cheng ([email protected])

A major overhead of using FPGAs is that regular operations, such as matrix multiplications (MMs), have been well-optimized on other ASIC devices such as GPUs and vector processors. Recently, there has been interest in exploring accelerator architectures on AMD Versal ACAP devices. A Versal device contains an array of vector processors, named AI Engines (AIEs). These AIEs are specialized for accelerating SIMD operations, such as MMs at a given set of precisions, connected using reconfigurable switches for custom data parallelism. Recent work has shown significant performance benefits in accelerating transformer models using Versal devices, compared to FPGAs and GPUs. However, AIEs are often hard to program due to the design space across spatial memory mapping and data parallelism.

Triton is a language and compiler for parallel programming. It aims to provide a Python-based programming environment for productively writing custom DNN compute kernels capable of running at maximal throughput on modern GPU hardware. This project aims to lift programming burden for AIEs with the help of the Triton interface.

Note: This project would be challenging and would require good software engineering skills

Useful readings:

https://triton-lang.org/main/index.html

https://github.com/Xilinx/mlir-air

Skill requirements

The project would best suit a student who:

has excellent software engineering skills in Python/C++
has a basic knowledge of computer architecture
has an interest in cutting-edge computer architecture or compiler research.