Publication

AStream: Ad-hoc Shared Stream Processing

Tilmann Rabl Jeyhun Karimov; Tilmann Rabl; Volker Markl

In: SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. ACM SIGMOD International Conference on Management of Data (SIGMOD-2019), June 30 - July 5, Amsterdam, Netherlands, Pages 607-622, ISBN 978-1-4503-5643-5, ACM, New York, NY, 6/2019.

Abstract

In the last decade, many distributed stream processing engines (SPEs) were developed to perform continuous queries on massive online data. The central design principle of these engines is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. In many real applications, streams are not only processed with long-running queries, but also thousands of short-running ad-hoc queries. To support this efficiently, it is essential to share resources and computation for stream ad-hoc queries in a multi-user environment. The goal of this paper is to bridge the gap between stream processing and ad-hoc queries in SPEs by sharing computation and resources. We define three main requirements for ad-hoc shared stream processing: (1) Integration: Ad-hoc query processing should be a composable layer which can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed in a consistent manner and ensure exactly-once semantics and correctness; (3) Performance: In contrast to state-of-the-art SPEs, ad-hoc SPE should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Based on these requirements, we have developed AStream, an ad-hoc, shared computation stream processing framework. To the best of our knowledge, AStream is the first system that supports distributed ad-hoc stream processing. AStream is built on top of Apache Flink. Our experiments show that AStream shows comparable results to Flink for single query deployments and outperforms it in orders of magnitude with multiple queries.

Projects

STREAMLINE - Streamlined Analysis of Data at Rest and Data in Motion