Publication
Bridging the Gap: Towards Optimization Across Linear and Relational Algebra
Andreas Kunft; Alexander Alexandrov; Asterios Katsifodimos; Volker Markl
In: Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond. ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR), located at SIGMOD/PODS '16, July 1, San Francisco, CA, USA, ISBN 978-1-4503-4311-4, ACM, New York, NY, USA, 2016.
Abstract
Advanced data analysis typically requires some form of pre- processing in order to extract and transform data before processing it with machine learning and statistical analy- sis techniques. Pre-processing pipelines are naturally ex- pressed in dataflow APIs (e.g., MapReduce, Flink, etc.), while machine learning is expressed in linear algebra with iterations. Programmers therefore perform end-to-end data analysis utilizing multiple programming paradigms and sys- tems. This impedance mismatch not only hinders produc- tivity but also prevents optimization opportunities, such as sharing of physical data layouts (e.g., partitioning) and data structures among different parts of a data analysis program.
The goal of this work is twofold. First, it aims to alleviate the impedance mismatch by allowing programmers to author complete end-to-end programs in one engine-independent language that is automatically parallelized. Second, it aims to enable joint optimizations over both relational and lin- ear algebra. To achieve this goal, we present the design of Lara, a deeply embedded language in Scala which enables authoring scalable programs using two abstract data types (DataBag and Matrix) and control flow constructs. Pro- grams written in Lara are compiled to an intermediate rep- resentation (IR) which enables optimizations across linear and relational algebra. The IR is finally used to compile code for different execution engines.