Compilers for Machine Learning
Machine learning applications are becoming ubiquitous in large-scale production systems. With that growth and the scaling in data volume and model complexity, the focus on efficiently executing machine learning models has become even greater. The push for increased energy efficiency has led to the emergence of diverse heterogeneous system and accelerator architectures. In parallel, model complexity and diversity pushed for higher productivity systems, more powerful programming abstractions, type systems, language embeddings, frameworks and libraries. Compilers have historically been the bridge between programmer efficiency and high performance code, allowing the expression of code that remains understandable and productive to port and extend, while producing high-performance code for diverse architectures. As such, compiler techniques have been increasingly incorporated into machine learning frameworks. This goes both ways: given the broadening gap between high-level constructs and hardware accelerators, compilers in machine learning frameworks also emerged as natural clients of machine learning techniques, from domain-specific heuristics to autotuning.
This workshop aims to highlight cutting edge work and research that incorporate compiler techniques and algorithms in optimizing machine learning workloads. Compiler techniques affect a large part of the machine learning stack. The workshop topics span from high-level abstract representations to code generation for accelerators. The list of invited speakers are similarly experts across the different levels of the stack. The workshop does not have formal proceedings, and presentations will include ample time for interaction.
The workshop features presentations from leading ML compiler experts from academia and industry. Presentations are grouped into 4 sessions, 2 in the morning and 2 in the late afternoon, to accomodate for attendees and speakers reaching out from all over the world.
All times below are displayed in the Eastern Time zone (EST, UTC-5).
09:00 - Gather space opens for 24h
09:20-09:30 - Opening
09:30-11:00 - Session 1 - Chaired by Albert Cohen, Google
Tianqi Chen, Carnegie Mellon University & OctoML: "Towards Automatic Scheduling for Tensorized Computation".
Deploying deep learning models on various devices has become an important topic. The wave of hardware specialization brings a diverse set of acceleration primitives for multi-dimensional tensor computations. These new acceleration primitives, along with the emerging machine learning models, bring tremendous engineering challenges. In this talk, I will talk about TVM's latest developments in compiler abstraction for optimizing programs with these tensor computation primitives. I will also talk about our latest progress in automatic scheduling to automate the program optimizations for diverse acceleration backends.
Uday Bondhugula, Polymage Labs & Indian Institute of Science: "Polyhedral Building Blocks for High-Performance Code Generation in MLIR".
The polyhedral framework is a compiler representation that abstracts loops in code as dimensions of polyhedra. The representation is convenient for the transformation of multi-dimensional loop nests and arrays. Various effective transformation and parallelization techniques have been developed using the polyhedral framework over the past two decades. This talk deals with how these techniques can be brought into the MLIR infrastructure so that they become more modular, reusable, and compiler developer friendly. We call this planned specialized infrastructure Polyhedral Compiler Building Blocks, and they are in the form of MLIR operations that are code generated using techniques including polyhedral AST generation driven by the Integer Set Library as well as the MLIR Affine dialect infrastructure. We will use an example to show how powerful code transformation directives can be encoded in a compact manner to realize multi-level tiling, buffer packing, loop interchange, unroll-and-jam, and vectorization, and obtain performance competitive with that of highly tuned hand-optimized code albeit with the same compiler infrastructure that is used for other purposes.
Tobias Grosser & Arjun Pitchanathan, U. Edinburgh & IIIT Hyderabad: "A high-performance polyhedral math library as a foundation for AI compilers".
A plethora of program analysis and optimization techniques rely on Presburger arithmetic at their heart. However, such techniques are often considered too slow for production use. Linear programming lies at the heart of many algorithms used in Presburger arithmetic. While today’s best LP solvers are optimized for complex problems with thousands of dimensions, linear programming, as used in compilers, is typically applied to small and seemingly trivial problems, but to many instances in a single compilation run. As a result, compilers do not benefit from decades of research on optimizing large-scale linear programming.
We describe a new implementation of the Simplex algorithm, targeted at compilers. A novel theory of transprecision computation applied from individual elements to full data-structures provides the computational foundation. By carefully combining it with optimized representations for small and sparse matrices and specialized small-coefficient algorithms, we (1) reduce memory traffic, (2) exploit wide vectors, and (3) use low-precision arithmetic units effectively.
We evaluate our work by embedding our solver into a state-of-the-art integer set library and implement one essential operation, coalescing, on top of our transprecision solver. Our evaluation shows more than an order-of-magnitude speedup on the core simplex pivot operation and a mean speedup of 3.2x (vs. GMP) and 4.6x (vs. IMath) for the optimized coalescing operation. Our results demonstrate that our optimizations exploit the wide SIMD instructions of modern microarchitectures effectively. Finally, we briefly discuss how transprecision computation could be scaled to an entire Presburger library.
30' Break - Virtual coffee/tea on Gather
11:30-13:00 - Session 2 - Chaired by Ayal Zaks, Intel and Technion
Sanket Tavarageri, Intel Labs: "PolyDL: Polyhedral Compiler Optimizations for Deep Learning Workloads".
In the presentation, I will describe our ongoing polyhedral compilation and machine learning driven approaches to generate optimized code for Deep Learning (DL) primitives. Underpinning our work is a key observation that to achieve high performance especially on CPUs, we have to optimize for various hardware features including multi-level caches and vector units and consequently, the different optimization criteria require altogether distinct approaches. We are building a new program optimization framework - PolyDL where we have designed novel polyhedral compilation techniques and machine learning approaches for outer loop and inner loop optimizations respectively. The loop nest of the DL primitive is analyzed using a polyhedral model based data reuse algorithm, and the best loop transformations are derived. For effective SIMD vectorization, we have built a domain specific code generator and using Reinforcement Learning (RL) generate the best inner loop code.
Richard Osborne, Graphcore: "Understanding the Poplar Graph Compiler for IPUs".
Poplar is a software framework for Graphcore's Intelligence Processing Unit (IPU) range of devices that sits underneath machine learning frameworks like TensorFlow and PyTorch. Poplar's programming model is based around computational graphs containing large numbers of fine-grained tasks.
The Poplar Graph Compiler takes this graph structure and maps it across IPU's massively parallel architecture. We look at how knowing in advance what data needs to be written where lets Poplar produce highly optimised code with communication patterns that are compiled in order to take full advantage of the IPU's high performance on-chip data exchange network.
Rune Holm, ARM: "Memory access planning for NPUs".
Neural network algorithms tend to be arithmetic intense, so for good performance on a CPU, the key is good code generation to reach max utilisation of the SIMD units. However, NPUs add so many ALU units that the workload becomes memory bound again, and real-world NPUs often run workloads at less than half their stated Tops rating due to bandwidth limitations. Therefore, to get optimal performance for an NPU, the name of the game is optimisation of memory access patterns.
There are high level transformations for the reordering, slicing and interleaving of the execution of neural network layers, as well as the slicing and allocation of memory buffers, that can have drastic effects on cache footprint, memory bandwidth and therefore performance. Yet this is a complex search space, where greedy rewrites often will do the wrong thing. We present an execution planning framework that searches for globally optimal schedules within seconds, yielding considerable improvements in performance.
4h Break - Virtual lunch/dinner on Gather
17:00-18:30 - Session 3 - Chaired by Tatiana Shpeisman, Google
Ron Estrin, Cerebras: "Polyhedral compilation techniques for code generation on spatial architectures".
The Cerebras Wafer Scale Engine (WSE) consists of a massive number of homogeneous compute cores connected to each other on a grid interconnection network. The cores can only access their local scratchpad memory and the hardware implements a dataflow-triggered execution model, both of which pose interesting challenges as a target for compilation. We leverage the regular nature of neural network operations to generate efficient code for this unique architecture.
We describe techniques to compile neural network operations, expressed as directed acyclic graphs of tensor operations, to C-like code targeting the WSE. While the problem of distributing loops of a tensor operation across the two-dimensional grid of compute cores is an interesting topic, this talk focuses on the challenges and polyhedral techniques required to generate code for a given loop tiling.
For example, the compiler localizes the tensors to account for the distributed memory architecture. Further, to conform to the dataflow execution model, the processing on a core is modeled as an abstract task graph where each task denotes some atomic computation and the edges denote dataflow triggering. A concrete implementation is produced for the abstract task graph using hardware primitives such as task activation and callbacks. This talk walks through a simple example of the compilation process, demonstrating the various internal representations as well as the final code.
Benoit Steiner, Facebook: "Learning to optimize neural networks quickly".
As the usage of machine learning techniques is becoming ubiquitous, the efficient execution of neural networks is crucial to many applications. Several efforts leverage empirical feedback-driven performance optimization techniques to search for fast implementations of neural networks. These autotuning based approaches rely on traditional search techniques, such as genetic algorithm or simulated annealing, assisted by machine learning to explore the solution space. However, they suffer from long runtime which hampers their adoption.
In this talk, we present a new reinforcement learning based approach. We model the neural network optimization procedure, or scheduling, as a Markov Decision Process, and present a new technique to accurately predict the expected performance of a partial schedule using a LSTM over carefully engineered features that describe each DNN operator and their current scheduling choices. Using these predictions we build a policy which we leverage to rapidly identify an efficient schedule.
Our evaluation shows that our performance predictions are one order of magnitude more accurate than the state of the art. This enables us to find schedules that improve the execution performance of deep neural networks by 2.6× over Halide and 1.5× over TVM. Moreover, our technique is two to three orders of magnitude faster than that of these tools, and completes in seconds instead of hours.
Xiaoyong Liu, Alibaba: "An MLIR-Based end-to-end dynamic shape compiler".
Modern machine learning model complexity and diversity demand for an efficient and general dynamic shape-friendly compiler and runtime layer with great user experience. By leveraging MLIR infrastructure, starting with inference, we have been developing a MLIR-based end-to-end dynamic shape compiler for machine learning, which is fast, efficient and reliable for production environments across a wide range of hardware architectures.
This talk will introduce the foundation of dynamic shape features. You’ll learn about this specific AI compiler’s core architect, deployment flow, optimization pipeline, memory usage and runtime layer. In a real-world case study session, experiments show up to 3.3x speedup on GPU than TensorFlow/Pytorch out of the box. Finally, I’ll talk about further works to generalize the AI compiler’s capability.
30' Break - Virtual coffee/tea on Gather
19:00-20:30 - Session 4 - Chaired by Jacques Pienaar, Google
Chao Liu & Wen-Heng (Jack) Chung, AMD: "Realize implicit GEMM-based convolutions on AMD GPU using MLIR".
We examine steps needed to realize an efficient 2D convolution GPU kernel generator on AMD GPU using MLIR. We employ several primitives for composing complex tensor computations easily and efficiently. By introducing the concept of generic tensors, and a mathematically rigorous definition of tensor coordinate space transformation, we can reduce an arbitrary convolution into a dense matrix-matrix multiplication, which is well studied and are highly efficient on modern GPU architectures. We close functional gaps between existing MLIR dialects and AMD GPU-specific instructions and construct an implementation with satisfactory performance results.
Xinmin Tian & Yuanke Luo, Intel: "DPC++ Compiler and Performance Tuning for AI workloads".
In the high-performance computing realm, emerging DPC++ / SYCL programming model is getting a great attraction to a wide range AI/HPC applications and Hardware accelerators including GPUs. DPC++ programming model provides a portable solution to unpack the possibilities of our CPU-to-GPU offloading model, this presentation includes:
DPC++ Compiler Overview;
DPC++ Performance tuning for AI/Rodinia Workload on CPU and GPU;
AMX Model and Compiler Support in LLVM.
Jian Hui Li, Intel: "oneDNN Graph API: unify deep learning framework integration and maximize compute efficiency for multiple AI hardware".
oneDNN Graph API extends oneDNN with a high-level graph API for multiple AI hardware classes (CPU, GPU, accelerators). With a flexible graph interface, it maximizes the optimization opportunity for generating efficient code across a variety of Intel and non-Intel HW and can be closely integrated with ecosystem framework and inference engines. Meanwhile, it aims to unify and reduce the cost of framework integration so that oneDNN Graph based integration can be reused for various backends developed for a diverse range of hardware devices.
The talk will cover key interface design considerations allowing various implementations of oneDNN Graph API. oneDNN Graph API preview has been open-sourced in Dec 2020 (https://github.com/oneapi-src/oneDNN/tree/dev-graph). The talk will show how oneDNN Graph API provides flexible fusion interface for oneDNN's hand-tuned fusion capability and can be used to integrate MLIR based PlaidML backend with deep learning framework.
20:30 - Closing - Gather space remains open for 12h
Diego Caballero, Intel
Albert Cohen, Google
Jacques Pienaar, Google
Tatiana Shpeisman, Google
Ayal Zaks, Intel and Technion