Compilers for Machine Learning

5th C4ML workshop, at CGO 2024

Sunday, March 3, 2024

In person: Edinburgh, UK

Scope

Machine learning applications are becoming ubiquitous in large-scale production systems. With that growth and the scaling in data volume and model complexity, the focus on efficiently executing machine learning models has become even greater. The push for increased energy efficiency has led to the emergence of diverse heterogeneous system and accelerator architectures. In parallel, model complexity and diversity pushed for higher productivity systems, more powerful programming abstractions, type systems, language embeddings, frameworks and libraries. Compilers have historically been the bridge between programmer efficiency and high performance code, allowing the expression of code that remains understandable and productive to port and extend, while producing high-performance code for diverse architectures. As such, compiler techniques have been increasingly incorporated into machine learning frameworks. This goes both ways: given the broadening gap between high-level constructs and hardware accelerators, compilers in machine learning frameworks also emerged as natural clients of machine learning techniques, from domain-specific heuristics to autotuning.

This workshop aims to highlight cutting edge work and research that incorporates compiler techniques and algorithms with optimizing machine learning workloads. Compiler techniques affect a large part of the machine learning stack. The workshop topics span from high-level abstract representations to code generation for accelerators. The list of invited speakers are similarly experts across the different levels of the stack. The workshop does not have formal proceedings, and presentations will include ample time for interaction.

Program

The workshop features 8 presentations from leading ML compiler experts from industry and academia. 7 posters will be displayed at the end of the workshop (together with the main conference's welcome and poster reception), with short talks introducing the posters in the last session.

Venue: Edinburgh International Conference Center (EICC).
Room: Carrick 1, 2.

09:15-09:20 - Opening

09:20-10:00 - Session 1 - Keynote / Debunking ML for Compilers

[slides] Fabrice Rastello, INRIA
Assessment of the Effectiveness of Analytical and ML-based Performance Models for Compiler Optimization

10:00-10:20 - Break

10:20-12:20 - Session 2 - ML Compiler Construction

[slides] Martin Lücke, University of Edinburgh and Google
Are compiler passes really enough?
[slides] Renato Golin, Intel Research
Towards a high-performance AI compiler with upstream MLIR
[slides] Kunwar Grover, AMD
Custom PyTorch Kernels with IREE and Turbine

12:20-13:20 - Lunch

13:20-15:20 - Session 3 - Target- and domain-specific optimization

[slides] Ian Bearman, Microsoft
Scaling Triton to Multiple Platforms with Triton-Shared
[slides] Javed Absar, Qualcomm
Experience with Triton Lowering and Optimization for Qualcomm Hexagon
Elen Kalda, Luke Hutton, ARM
Introducing vector length agnostic programming into ML compilation: Comparing SVE and SME enablement in TVM and MLIR

15:20-15:40 - Break

15:40-16:20 - Session 4 - ML compiler infrastructure for general-purpose computing

[slides] Lukas Sommer, Codeplay and Intel
Machine learning compiler optimizations for applications written in modern C++ and SYCL

16:20-17:20 - Session 5 - Poster Lightning Talks

Ari Rasch, Richard Schulze, Sergei Gorlatch, University of Muenster
Code Generation & Optimization for Deep-Learning Graphs via Multi-Dimensional Homomorphisms
Hongbin Zhang, Xulin Zhou, Jiuyang Liu, Zikang Liu, Linquan Wei, Yuliang Li, Taiqi Zheng, Meng Li, Hongyu Lin, Zhongyu Qin, Hanghang Cao, Jiongjia Lu, Weijia Li, Mingjie Xing, Yanjun Wu, Chinese Academy of Sciences, Huazhong University of Science and Technology, Beihang University, East China Normal University, NanJing University
Buddy Compiler: An End-to-End AI Compiler from DSL to DSA
Jude Haris, Nicolas Bohm Agostini, Antonino Tumeo, David Kaeli, Jose Cano, University of Glasgow and Northeaster University
Data Transfer Optimizations for Host-CPU and Accelerators in AXI4MLIR
[slides] S. VenkataKeerthy, Siddharth Jain, Umesh Kalvakuntla, G Pranav Sai, Rajiv S Chitale, Eugene Brevdo, Albert Cohen, Mircea Trofin, Ramakrishna Upadrasta, IIT Hyderabad and Google
MLCompilerBridge: A Tool for interfacing ML and Compilers
Marco Siracusa, Miquel Moreto, from Barcelona Supercomputing Center
Compiling Embedding Operations in MLIR to Decoupled Access-Execute Architectures
Ludger Paehler, Aiden Grossman, Jose Monsalve-Diaz, Tal Ben-Nun, Konstantinos Parasyris, Johannes Doerfert, TUM, UC Davis, ANL, LLNL
LLamaVM: Unlocking the Power of Intermediate Representation

18:00-20:00 - Poster Reception

Abstracts

Fabrice Rastello, INRIA - Assessment of the Effectiveness of Analytical and ML-based Performance Models for Compiler Optimization
Martin Lücke, University of Edinburgh and Google - Are compiler passes really enough?

The traditional model of structuring compilers into a series of discrete, sequential passes has stood as a foundational principle since the earliest days of computing. However, the focus on heterogenous target architectures and evolving compute loads demands more dynamic structures capable of adapting code generation strategies based on the structure of the program to be compiled.

In response, solutions such as dynamic pass pipelines and various approaches to scheduling APIs have emerged to drive compilers more flexibly.

In this talk we take a brief look at the design considerations for passes in the past and examine abstractions we developed on top of that to meet the requirements of today. We identify shortcomings of our current approaches and offer a glimpse into the steps we are taking towards controlling compilers more flexibly with formal analysis and automation in mind.

Renato Golin, Intel Research - Towards a high-performance AI compiler with upstream MLIR

This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance from a generic linear algebra high-level abstraction. We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from TensorFlow and PyTorch, performs cache-level optimizations and lowering to micro-kernels for efficient vectorization, achieving over 90% of the performance of ninja-written equivalent programs. The contributions of this work include: (1) Packing primitives on the tensor dialect and passes for cache-aware distribution of tensors (single and multi-core) and type-aware instructions (VNNI, BFDOT, BFMMLA), including propagation of shapes across the entire function; (2) A linear algebra pipeline, including tile, fuse and bufferization strategies to get model-level IR into hardware friendly tile calls; (3) A mechanism for micro-kernel lowering to an open source library that supports various CPUs.
We have contributed substantially to the upstream Linalg and Tensor dialects, packing, tile and fuse passes, which is the main topic of this paper. Our goal is to continue contributing upstream and to gather the MLIR community around the same goal, so that we can make forward progress upstream, avoiding the not-invented-here syndrome of far too many downstream work.

Kunwar Grover, AMD - Custom PyTorch Kernels with IREE and Turbine

The popularity of fused kernels like Flash Attention and Paged Attention has created a need for hand-written kernel implementations that achieve peak performance on specialized hardware. DSLs like Triton have shown that choosing the right abstraction level for writing these kernels can significantly reduce their implementation complexity. We introduce Turbine Kernels (TK), a traced, dependently typed Python DSL that allows users to expose these abstraction levels from a compiler and productionize them as custom PyTorch operations. We target the IREE compiler, which embraces a modular approach for codegen and, instead of doing a one-shot conversion, does a gradual lowering through different levels of abstraction for different backends. We show how we have been using TK to expose different entry points to the IREE compiler and create high-performance custom ops for our PyTorch models.

Ian Bearman, Microsoft - Scaling Triton to Multiple Platforms with Triton-Shared

Triton is an open-source kernel authoring language from OpenAI. It allows programmers to efficiently produce high-performance code for machine learning. While core Triton development focuses on GPU code generation for Nvidia and AMD GPUs, this is just the beginning of what Triton can do. Through the triton-shared project, the AI Compiler team at Microsoft is brining Triton code generation to more varied platforms including NPUs and CPUs. This talk will touch on the Triton programming language, the triton-shared project, and the MLIR compiler framework.

Javed Absar, Qualcomm - Experience with Triton Lowering and Optimization for Qualcomm Hexagon

Triton is an open-source Python-based programming language that lets ML programmers write easily highly efficient kernels for GPUs and CPUs. Pioneered at OpenAI with initial focus on NVIDIA GPUs, Triton now has support for other backends https://github.com/openai/triton/tree/main/third_party as well.In this talk, we will describe our on-going work on Triton support for Qualcomm Hexagon target. An approach we took is to leverage the Triton-to-Linalg conversion that team at Microsoft has been developing and supporting. Kernels written in Python Triton lower to Triton-IR which is now an MLIR dialect. The Triton-IR is then converted to MLIR Linalg Ops. A set of custom and built-in MLIR passes then perform fusion, tiling, vectorization and lower the IR to high performance multi-threaded, vectorized LLVM-IR suited for Qualcomm Hexagon target.

Elen Kalda, Luke Hutton, ARM - Introducing vector length agnostic programming into ML compilation: Comparing SVE and SME enablement in TVM and MLIR

Scalable Vector Extension (SVE) in Arm® architecture introduced a set of vectors with compile time unknown vector length which can vary between 128 and 2048 bits, depending on hardware implementation. Scalable Matrix Extension (SME) took that concept into two dimensions, allowing to effectively target outer product based operations (e.g. matrix multiply) in a vector length agnostic programming model. Both of these extensions are instrumental in speeding up machine learning workloads on Arm®-based devices.

Including support for SVE and SME in compilers has required significant changes throughout to the codebase due to changing the core data types that represent vectors and matrices. In this talk we will be discussing the introduction of SVE and SME into TVM and MLIR machine learning compiler stacks, highlighting the similarities and differences of the implementations. Both of the stacks target SVE and SME through lowering to LLVM, but what happens before that differs in many aspects due to the different designs of the two ML frameworks.

Lukas Sommer, Codeplay and Intel - Machine learning compiler optimizations for applications written in modern C++ and SYCL

Machine learning compilers benefit from the preservation of high-level semantics, enabling them to leverage domain-specific knowledge to perform advanced transformations, such as graph optimizations. This is not true however for machine learning code written using general purpose programming languages such as C++, or the C++-based SYCL heterogeneous programming model. In these cases, existing compiler infrastructure usually performs translation of the code to a lower level set of instructions early. This results in missed optimization opportunities that can be seen in other machine learning language compilers and applications written with SYCL, such as portDNN or portBLAS, can benefit from the same kind of optimizations.
In this talk, we introduce SYCL-MLIR as an alternative to the current methods used to compile SYCL code. Unlike the existing methods, SYCL-MLIR utilizes MLIR to ensure the retention of rich semantics, enabling more comprehensive higher-level optimizations. We present the current status of SYCL-MLIR and how it leverages existing MLIR passes used in MLIR-based ML frameworks to speedup SYCL code. In comparison with existing LLVM-based SYCL compilers, SYCL-MLIR achieves significant speedups for the machine learning code we have tested.
In the talk, we also discuss how new high-level C++ features, such as mdspan, could better align with constructs commonly used in machine learning compilers, such as tensors. Using MLIR-based C++ compilers such as SYCL-MLIR to preserve the semantics of such constructs could bring the ability to leverage existing optimizations from machine learning compilers for applications written in C++.

Abstracts

Fabrice Rastello, INRIA - Assessment of the Effectiveness of Analytical and ML-based Performance Models for Compiler Optimization
Martin Lücke, University of Edinburgh and Google - Are compiler passes really enough?

The traditional model of structuring compilers into a series of discrete, sequential passes has stood as a foundational principle since the earliest days of computing. However, the focus on heterogenous target architectures and evolving compute loads demands more dynamic structures capable of adapting code generation strategies based on the structure of the program to be compiled.

In response, solutions such as dynamic pass pipelines and various approaches to scheduling APIs have emerged to drive compilers more flexibly.

In this talk we take a brief look at the design considerations for passes in the past and examine abstractions we developed on top of that to meet the requirements of today. We identify shortcomings of our current approaches and offer a glimpse into the steps we are taking towards controlling compilers more flexibly with formal analysis and automation in mind.

Renato Golin, Intel Research - Towards a high-performance AI compiler with upstream MLIR

This work proposes a compilation flow using open-source compiler passes to build a framework to achieve ninja performance from a generic linear algebra high-level abstraction. We demonstrate this flow with a proof-of-concept MLIR project that uses input IR in Linalg-on-Tensor from TensorFlow and PyTorch, performs cache-level optimizations and lowering to micro-kernels for efficient vectorization, achieving over 90% of the performance of ninja-written equivalent programs. The contributions of this work include: (1) Packing primitives on the tensor dialect and passes for cache-aware distribution of tensors (single and multi-core) and type-aware instructions (VNNI, BFDOT, BFMMLA), including propagation of shapes across the entire function; (2) A linear algebra pipeline, including tile, fuse and bufferization strategies to get model-level IR into hardware friendly tile calls; (3) A mechanism for micro-kernel lowering to an open source library that supports various CPUs.
We have contributed substantially to the upstream Linalg and Tensor dialects, packing, tile and fuse passes, which is the main topic of this paper. Our goal is to continue contributing upstream and to gather the MLIR community around the same goal, so that we can make forward progress upstream, avoiding the not-invented-here syndrome of far too many downstream work.

Kunwar Grover, AMD - Custom PyTorch Kernels with IREE and Turbine

The popularity of fused kernels like Flash Attention and Paged Attention has created a need for hand-written kernel implementations that achieve peak performance on specialized hardware. DSLs like Triton have shown that choosing the right abstraction level for writing these kernels can significantly reduce their implementation complexity. We introduce Turbine Kernels (TK), a traced, dependently typed Python DSL that allows users to expose these abstraction levels from a compiler and productionize them as custom PyTorch operations. We target the IREE compiler, which embraces a modular approach for codegen and, instead of doing a one-shot conversion, does a gradual lowering through different levels of abstraction for different backends. We show how we have been using TK to expose different entry points to the IREE compiler and create high-performance custom ops for our PyTorch models.

Ian Bearman, Microsoft - Scaling Triton to Multiple Platforms with Triton-Shared

Triton is an open-source kernel authoring language from OpenAI. It allows programmers to efficiently produce high-performance code for machine learning. While core Triton development focuses on GPU code generation for Nvidia and AMD GPUs, this is just the beginning of what Triton can do. Through the triton-shared project, the AI Compiler team at Microsoft is brining Triton code generation to more varied platforms including NPUs and CPUs. This talk will touch on the Triton programming language, the triton-shared project, and the MLIR compiler framework.

Javed Absar, Qualcomm - Experience with Triton Lowering and Optimization for Qualcomm Hexagon

Triton is an open-source Python-based programming language that lets ML programmers write easily highly efficient kernels for GPUs and CPUs. Pioneered at OpenAI with initial focus on NVIDIA GPUs, Triton now has support for other backends https://github.com/openai/triton/tree/main/third_party as well.In this talk, we will describe our on-going work on Triton support for Qualcomm Hexagon target. An approach we took is to leverage the Triton-to-Linalg conversion that team at Microsoft has been developing and supporting. Kernels written in Python Triton lower to Triton-IR which is now an MLIR dialect. The Triton-IR is then converted to MLIR Linalg Ops. A set of custom and built-in MLIR passes then perform fusion, tiling, vectorization and lower the IR to high performance multi-threaded, vectorized LLVM-IR suited for Qualcomm Hexagon target.

Elen Kalda, Luke Hutton, ARM - Introducing vector length agnostic programming into ML compilation: Comparing SVE and SME enablement in TVM and MLIR

Scalable Vector Extension (SVE) in Arm® architecture introduced a set of vectors with compile time unknown vector length which can vary between 128 and 2048 bits, depending on hardware implementation. Scalable Matrix Extension (SME) took that concept into two dimensions, allowing to effectively target outer product based operations (e.g. matrix multiply) in a vector length agnostic programming model. Both of these extensions are instrumental in speeding up machine learning workloads on Arm®-based devices.

Including support for SVE and SME in compilers has required significant changes throughout to the codebase due to changing the core data types that represent vectors and matrices. In this talk we will be discussing the introduction of SVE and SME into TVM and MLIR machine learning compiler stacks, highlighting the similarities and differences of the implementations. Both of the stacks target SVE and SME through lowering to LLVM, but what happens before that differs in many aspects due to the different designs of the two ML frameworks.

Lukas Sommer, Codeplay and Intel - Machine learning compiler optimizations for applications written in modern C++ and SYCL

Machine learning compilers benefit from the preservation of high-level semantics, enabling them to leverage domain-specific knowledge to perform advanced transformations, such as graph optimizations. This is not true however for machine learning code written using general purpose programming languages such as C++, or the C++-based SYCL heterogeneous programming model. In these cases, existing compiler infrastructure usually performs translation of the code to a lower level set of instructions early. This results in missed optimization opportunities that can be seen in other machine learning language compilers and applications written with SYCL, such as portDNN or portBLAS, can benefit from the same kind of optimizations.
In this talk, we introduce SYCL-MLIR as an alternative to the current methods used to compile SYCL code. Unlike the existing methods, SYCL-MLIR utilizes MLIR to ensure the retention of rich semantics, enabling more comprehensive higher-level optimizations. We present the current status of SYCL-MLIR and how it leverages existing MLIR passes used in MLIR-based ML frameworks to speedup SYCL code. In comparison with existing LLVM-based SYCL compilers, SYCL-MLIR achieves significant speedups for the machine learning code we have tested.
In the talk, we also discuss how new high-level C++ features, such as mdspan, could better align with constructs commonly used in machine learning compilers, such as tensors. Using MLIR-based C++ compilers such as SYCL-MLIR to preserve the semantics of such constructs could bring the ability to leverage existing optimizations from machine learning compilers for applications written in C++.

Organizers

Albert Cohen, Google
Dibyendu Das, Intel
Diego Caballero, Google
Gokcen Kestor, PNNL
Jacques Pienaar, Google

Contact us

c4ml@googlegroups.com