Compilers for Machine Learning

5th C4ML workshop, at CGO 2024

Sunday, March 3, 2024

In person: Edinburgh, UK


Machine learning applications are becoming ubiquitous in large-scale production systems. With that growth and the scaling in data volume and model complexity, the focus on efficiently executing machine learning models has become even greater. The push for increased energy efficiency has led to the emergence of diverse heterogeneous system and accelerator architectures. In parallel, model complexity and diversity pushed for higher productivity systems, more powerful programming abstractions, type systems, language embeddings, frameworks and libraries. Compilers have historically been the bridge between programmer efficiency and high performance code, allowing the expression of code that remains understandable and productive to port and extend, while producing high-performance code for diverse architectures. As such, compiler techniques have been increasingly incorporated into machine learning frameworks. This goes both ways: given the broadening gap between high-level constructs and hardware accelerators, compilers in machine learning frameworks also emerged as natural clients of machine learning techniques, from domain-specific heuristics to autotuning.

This workshop aims to highlight cutting edge work and research that incorporates compiler techniques and algorithms with optimizing machine learning workloads. Compiler techniques affect a large part of the machine learning stack. The workshop topics span from high-level abstract representations to code generation for accelerators. The list of invited speakers are similarly experts across the different levels of the stack. The workshop does not have formal proceedings, and presentations will include ample time for interaction.


The workshop features 8 presentations from leading ML compiler experts from industry and academia. 7 posters will be displayed at the end of the workshop (together with the main conference's welcome and poster reception), with short talks introducing the posters in the last session.

Venue: Edinburgh International Conference Center (EICC).
Room: Carrick 1, 2.

09:15-09:20 - Opening

09:20-10:00 - Session 1 - Debunking ML for Compilers

10:00-10:20 - Break

10:20-12:20 - Session 2 - ML Compiler Construction

12:20-13:20 - Lunch

13:20-15:20 - Session 3 - Target- and domain-specific optimization

15:20-15:40 - Break 

15:40-16:20 - Session 4 - ML compiler infrastructure for general-purpose computing

16:20-17:20 - Session 5 - Poster Lightning Talks

18:00-20:00 - Poster Reception


The traditional model of structuring compilers into a series of discrete, sequential passes has stood as a foundational principle since the earliest days of computing. However, the focus on heterogenous target architectures and evolving compute loads demands more dynamic structures capable of adapting code generation strategies based on the structure of the program to be compiled.

In response, solutions such as dynamic pass pipelines and various approaches to scheduling APIs have emerged to drive compilers more flexibly.

In this talk we take a brief look at the design considerations for passes in the past and examine abstractions we developed on top of that to meet the requirements of today. We identify shortcomings of our current approaches and offer a glimpse into the steps we are taking towards controlling compilers more flexibly with formal analysis and automation in mind.

This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance from a generic linear algebra high-level abstraction. We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from TensorFlow and PyTorch, performs cache-level optimizations and lowering to micro-kernels for efficient vectorization, achieving over 90% of the performance of ninja-written equivalent programs. The contributions of this work include: (1) Packing primitives on the tensor dialect and passes for cache-aware distribution of tensors (single and multi-core) and type-aware instructions (VNNI, BFDOT, BFMMLA), including propagation of shapes across the entire function; (2) A linear algebra pipeline, including tile, fuse and bufferization strategies to get model-level IR into hardware friendly tile calls; (3) A mechanism for micro-kernel lowering to an open source library that supports various CPUs.
We have contributed substantially to the upstream Linalg and Tensor dialects, packing, tile and fuse passes, which is the main topic of this paper. Our goal is to continue contributing upstream and to gather the MLIR community around the same goal, so that we can make forward progress upstream, avoiding the not-invented-here syndrome of far too many downstream work.

The popularity of fused kernels like Flash Attention and Paged Attention has created a need for hand-written kernel implementations that achieve peak performance on specialized hardware. DSLs like Triton have shown that choosing the right abstraction level for writing these kernels can significantly reduce their implementation complexity. We introduce Turbine Kernels (TK), a traced, dependently typed Python DSL that allows users to expose these abstraction levels from a compiler and productionize them as custom PyTorch operations. We target the IREE compiler, which embraces a modular approach for codegen and, instead of doing a one-shot conversion, does a gradual lowering through different levels of abstraction for different backends.  We show how we have been using TK to expose different entry points to the IREE compiler and create high-performance custom ops for our PyTorch models.

Triton is an open-source kernel authoring language from OpenAI. It allows programmers to efficiently produce high-performance code for machine learning. While core Triton development focuses on GPU code generation for Nvidia and AMD GPUs, this is just the beginning of what Triton can do. Through the triton-shared project, the AI Compiler team at Microsoft is brining Triton code generation to more varied platforms including NPUs and CPUs. This talk will touch on the Triton programming language, the triton-shared project, and the MLIR compiler framework.

Triton is an open-source Python-based programming language that lets ML programmers write easily highly efficient kernels for GPUs and CPUs. Pioneered at OpenAI with initial focus on NVIDIA GPUs, Triton now has support for other backends as well.In this talk, we will describe our on-going work on Triton support for Qualcomm Hexagon target. An approach we took is to leverage the Triton-to-Linalg conversion that team at Microsoft has been developing and supporting. Kernels written in Python Triton lower to Triton-IR  which is now an MLIR dialect. The Triton-IR is then converted to MLIR Linalg Ops. A set of custom and built-in MLIR passes then perform fusion, tiling, vectorization and lower the IR to high performance multi-threaded, vectorized LLVM-IR suited for Qualcomm Hexagon target.

Scalable Vector Extension (SVE) in Arm® architecture introduced a set of vectors with compile time unknown vector length which can vary between 128 and 2048 bits, depending on hardware implementation. Scalable Matrix Extension (SME) took that concept into two dimensions, allowing to effectively target outer product based operations (e.g. matrix multiply) in a vector length agnostic programming model. Both of these extensions are instrumental in speeding up machine learning workloads on Arm®-based devices.

Including support for SVE and SME in compilers has required significant changes throughout to the codebase due to changing the core data types that represent vectors and matrices. In this talk we will be discussing the introduction of SVE and SME into TVM and MLIR machine learning compiler stacks, highlighting the similarities and differences of the implementations. Both of the stacks target SVE and SME through lowering to LLVM, but what happens before that differs in many aspects due to the different designs of the two ML frameworks.

Machine learning compilers benefit from the preservation of high-level semantics, enabling them to leverage domain-specific knowledge to perform advanced transformations, such as graph optimizations. This is not true however for machine learning code written using general purpose programming languages such as C++, or the C++-based SYCL heterogeneous programming model. In these cases, existing compiler infrastructure usually performs translation of the code to a lower level set of instructions early. This results in missed optimization opportunities that can be seen in other machine learning language compilers and applications written with SYCL, such as portDNN or portBLAS, can benefit from the same kind of optimizations.
In this talk, we introduce SYCL-MLIR as an alternative to the current methods used to compile SYCL code. Unlike the existing methods, SYCL-MLIR utilizes MLIR to ensure the retention of rich semantics, enabling more comprehensive higher-level optimizations. We present the current status of SYCL-MLIR and how it leverages existing MLIR passes used in MLIR-based ML frameworks to speedup SYCL code. In comparison with existing LLVM-based SYCL compilers, SYCL-MLIR achieves significant speedups for the machine learning code we have tested.
In the talk, we also discuss how new high-level C++ features, such as mdspan, could better align with constructs commonly used in machine learning compilers, such as tensors. Using MLIR-based C++ compilers such as SYCL-MLIR to preserve the semantics of such constructs could bring the ability to leverage existing optimizations from machine learning compilers for applications written in C++.