Compilers for Machine Learning
Machine learning applications in large-scale production systems have grown dramatically in the last couple of years. With that growth and the scaling in data volume and model complexity, the focus on efficiently executing these models has become even greater. This push for increased performance has led to an emergence of diverse heterogeneous architectures to accelerate these workloads. In parallel, model complexity and diversity pushed for even higher productivity systems, more powerful programming abstractions, type systems, language embeddings, etc. Compilers have historically been the bridge between programmer efficiency and high performance code, allowing the expression of code that remains understandable and productive to port and extend, while producing high-performance code for diverse architectures. As such, compiler techniques have been increasingly incorporated into machine learning frameworks. This actually goes both ways: given the broadening gap between high-level constructs and hardware accelerators, compilers in machine learning frameworks also emerged as natural clients of machine learning techniques, from advanced heuristics to generic autotuning.
This workshop aims to highlight work and research that incorporate compiler techniques and algorithms in optimizing machine learning workloads. Compiler techniques affect a large part of the machine learning stack. The workshop topics span from high-level abstract representations to code generation for accelerators. The list of invited speakers are similarly experts across the different levels of the stack. The workshop does not have formal proceedings, and presentations will include ample time for interaction.
The workshop aims to bring together practitioners working on compilers for machine learning as well as a wider community interested in this rapidly growing area, to raise awareness of the existing efforts and facilitate free exchange of ideas.
07:30 - 08:55 Breakfast (provided)
08:55 - 09:00 Workshop Opening
09:00 - 09:30 "Getting to Machine Learning from a General Purpose Compiler", Keno Fischer & Jameson Nash, Julia Computing
09:30 - 10:00 "A Programming Language and Compiler View on AI Systems", Tiark Rompf, Purdue
10:00 - 10:30 Break (provided)
10:30 - 11:00 "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning", Tianqi Chen, University of Washington
11:00 - 11:30 "Glow: Graph Lowering Compiler Techniques for Neural Networks", Jordan Fix & Roman Dzhabarov, Facebook
11:30 - 12:00 "Compiling ML with XLA", Bjarke Roune, Google
12:00 - 13:00 Lunch (provided)
13:00 - 13:30 "nGraph: Unlocking Next-Generation Deep Learning Performance with Compilers", Jayaram Bobba, Intel
13:30 - 14:00 "TensorRT - a Platform for Deep Learning Inference", Arch D. Robison, Nvidia
14:00 - 14:30 "The Sparse Tensor Algebra Compiler", Saman Amarasinghe, MIT
14:30 - 15:00 Break (provided)
15:00 - 15:30 "Extending PlaidML to Encompass the Modern Accelerator Landscape", Tim Zerrell, Intel
15:30 - 16:00 "Polyhedral Compilation of ML Computation Graphs", Vinod Grover, Nvidia
16:00 - 16:30 "Compiling Deep Neural Networks for ACAP Devices", Sam Bayliss & Stephen Neuendorffer, Xilinx
16:30 - 17:00 "MLIR Primer: A Compiler Infrastructure for the End of Moore’s Law", Chris Lattner & Jacques Pienaar, Google
17:00 - 18:00 Open Discussions - All speakers invited to interact with the audience (followed by welcome reception and poster session)
"Getting to Machine Learning from a General Purpose Compiler", Keno Fischer, Jameson Nash, Julia Computing
In many ways, the current paradigm of machine learning frameworks represents a failure of current compiler and programming language technology to deliver adequate performance from user-friendly code. Machine learning models are written as code in a high-level dynamic programming language like Python, but rather than compiling this code directly, we instead use this code to "metaprogram" a much more restricted "graph language" that is more amenable to compiler analysis and optimization. The reason for this is simple: Analyzing high-level dynamic programming languages is hard and analyzing simple data flow graphs is much simpler.
This convenience however, comes at a significant cost. We lose inter-operability with code not written in the restricted graph language, thus restricting innovation and cross-disciplinary collaboration. Additionally, we need to write two completely separate compiler stacks and we can't optimize across language boundaries. For simple models these restrictions may be fine and they have gotten us this far, but as models get more complex, more dynamic and start to integrate with more traditional programs (e.g. physics simulations, ODE solvers or environmental simulators in reinforcement learning), these drawbacks quickly become apparent.
We propose to go the other way, by enhancing the performance and hardware targeting capabilities of a general purpose compiler for a high-level dynamic programming language (Julia). This is a more difficult task for the compiler as it now needs to deduce information that was previously manifest and needs to be extensible enough to take advantage of domain information without precluding interoperability. On the other hand, it avoids the aforementioned drawbacks. Additionally, clever improvements to the compiler improve performance for all users of the language, not just those workin in Machine Learning. In this talk, we will discuss how we extract sufficient information from a dynamic language in order to apply compiler transforms traditionally reserved for static languages, the many points of compiler extensibility provided, backend retargetability to GPUs and TPUs and how putting all of this together allows us to obtain a competitive, modern Machine Learning stack in just a few thousand lines of code.
"A Programming Language and Compiler View on AI Systems", Tiark Rompf, Purdue
Current and emerging deep learning architectures call for an expressive high-level programming style with end-to-end differentiation and for a high-performance implementation at the same time. But the current generation of deep learning frameworks tend to limit either expressiveness and ease of use for increased performance (e.g., TensorFlow) or vice versa (e.g., PyTorch). In this paper we demonstrate our ideas for a “best of both worlds” approach, based on multi-stage programming and delimited continuations, two orthogonal ideas firmly rooted in programming languages research.
"TVM: An Automated End-to-End Optimizing Compiler for Deep Learning", Tianqi Chen, University of Washington
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. In this talk, we introduce TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.
"Glow: Graph Lowering Compiler Techniques for Neural Networks", Jordan Fix, Roman Dzhabarov, Facebook
We present the design of Glow, an open source machine learning compiler for heterogeneous hardware. Glow is a pragmatic approach to compilation that enables the generation of highly optimized code for multiple targets. It lowers the traditional neural network dataflow graph into a two-phase strongly-typed intermediate representation. The high-level intermediate representation allows the optimizer to perform domain-specific optimizations. The lower-level instruction-based address-only intermediate representation allows the compiler to perform memory-related optimizations, such as instruction scheduling, static memory allocation, and copy elimination. At the lowest level, the optimizer performs machine-specific code generation to take advantage of specialized hardware features.
Glow features a lowering phase which enables the compiler to support a high number of input operators as well as a large number of hardware targets by eliminating the need to implement all operators on all targets. The lowering phase is designed to reduce the input space and allow new hardware backends to focus on a small number of linear algebra primitives. To date, 10 companies have committed to supporting Glow in future silicon products. Each of their accelerators will likely differ in capabilities, and will use Glow for automating compilation tasks such as instruction selection, memory allocation and graph scheduling.
"Compiling ML with XLA", Bjarke Roune (presenter), and everyone on the XLA team, Google
XLA is a compiled high-performance backend for ML systems like TensorFlow and PyTorch with backends for CPU, GPU, TPU and other backends in development. This talk will go through how parts of XLA work and some of the things that we have learned while building it, including our approach to programming TPUs.
"nGraph: Unlocking Next-Generation Deep Learning Performance with Compilers", Jayaram Bobba, Intel
Deep Learning continues to evolve and drive many novel use-cases in real-world applications at scale. Delivering performance across these different use-cases is challenging given the wide range of existing/new hardware and the prevalence of multiple ML/DL frameworks for developing and deploying the models. Existing approaches based on kernel libraries and their integration with frameworks are clearly not sufficient. Graph-based compilers are showing great potential in accelerating deep learning performance.
nGraph is an open-source compilation and runtime library suite that has been developed to efficiently deliver performance across multiple ML frameworks on a range of hardware. In this talk, we will present a high-level overview of nGraph, its intermediate representation (IR), optimization pipeline and runtime interfaces. We will also discuss optimizations (e.g., memory allocation, data layout, computation caching) that the graph-based IR helps us unlock and share across multiple frameworks and hardware backends. Frameworks using nGraph to execute workloads have shown up to 45X performance boost when compared to their default implementations.
"TensorRT - a Platform for Deep Learning Inference", Arch D. Robison, Nvidia
NVIDIA TensorRT™ is a platform for high-performance deep learning inference. The “builder” portion of TensorRT lets users specify a network with applicative semantics, and compile it to an “engine” that efficiently executes the network on GPU work streams. Compiler aspects include fusing layers to reduce bandwidth, selecting kernels and formats for data interchange, and scheduling memory usage. A new “refitting” feature enables updating an engine’s weights in place. The new weights are specified at the network level and TensorRT deals with transforming those weights to what the optimized engine needs.
"The Sparse Tensor Algebra Compiler", Saman Amarasinghe, MIT
Tensor algebra is a powerful tool with novel applications in machine learning and data analytics as well as in traditional domains such as engineering, and science. Increasingly often the tensors are sparse, which means most components are zeros. To get the best performance, currently programmers are left to write kernels for every operation, with different mixes of sparse and dense tensors in different formats. There are countless combinations, which makes it impossible to manually implement and optimize them all. The Tensor Algebra Compiler (TACO) is the first system to automatically generate kernels for any tensor algebra operation on tensors in any of the commonly used formats. Its performance is competitive with best-in-class hand-optimized kernels in popular libraries, while supporting far more tensor operations. For more information, see http://tensor-compiler.org.
"Extending PlaidML to Encompass the Modern Accelerator Landscape", Tim Zerrell, Intel
PlaidML is a tensor compiler that automatically produces optimized kernels from hardware descriptions. A new IR inside PlaidML called Stripe extends the hardware model to a broader range of target architectures, including in particular hardware accelerator designs. Stripe enables developing parameterized hardware models with corresponding optimization passes that can be used for broad classes of architectures. This hardware model is highly abstract, enabling hardware / software co-design through efficient exploration of the overall design space.
"Polyhedral Compilation of ML Computation Graphs", Vinod Grover, Nvidia
In this talk we describe a core programming model which is expressive enough to describe a wide variety of ML computation graphs for inference and training. The model uses established PL concepts to build graph patterns using tensors.
We describe the compiler that implements this programming model. We use a polyhedral compiler called Diesel to generate optimized code for CPUs and GPUs. Fusion, data layout optimizations are some of the transformations done within this compiler. In this talk we will give an overview of our programming system and some initial results.
"Compiling Deep Neural Networks for ACAP Devices", Sam Bayliss, Stephen Neuendorffer, Xilinx
Deep Neural Networks are becoming a key computational workload for many systems, both embedded and in the data center. Executing these workloads efficiently requires an elegant combination of specialized hardware and domain-specific optimization. At the same time, the state of the art is changing rapidly, implying that effective solutions must be configurable and adaptable as new network structures and applications arise. In this talk, we will describe the heterogenous ACAP architecture and show how it can be used to solve a variety of deep learning problems. We argue that effectively leveraging such an architecture, while maintaining portability and fast time-to-solution, requires a variety of new compilation technologies and strategies.
"MLIR Primer: A Compiler Infrastructure for the End of Moore’s Law", Chris Lattner, Jacques Pienaar, and everyone on the MLIR team, Google
The growing diversity of domain-specific accelerators spans all scales from mobile devices to data centers. It constitutes a global challenge across the high-performance computing stack and is particularly visible in the field of Machine Learning (ML). Program representations and compilers need to support a variety of devices at multiple levels of abstraction, from scalar instructions to coarse-grain parallelism and large scale distribution of computation graphs. This puts great pressure on the construction of both generic and target-specific optimizations, with domain specific language support, interfaces with legacy and future infrastructure, and special attention to future-proofness, modularity and code reuse. This motivates the construction of a new infrastructure, unifying graph representations, ML operators, optimizations at different levels and also across levels, targets, ML frameworks, training and inference, quantization, tightly interacting with runtime systems. Compilers are expected to readily support new applications, to easily port to new hardware, to bridge many levels of abstraction from dynamic, managed languages to vector accelerators and software-managed memories, while exposing high level knobs for autotuning, enable just-in-time operation, provide diagnostics and propagate functional and performance debugging information across the entire stack, and delivering performance close enough to hand-written assembly in most cases. We will share our vision, progress and plans towards the design and public release of such a compiler infrastructure.