Zero-Shot Cost Models for Distributed Stream Processing
Roman Heinrich, Manisha Luthra, Harald Kornmayer, Carsten Binnig
Zero-Shot cost models aim to provide accurate cost predictions on dynamic and unseen workloads of Distributed Stream Processing Systems (DSPS). A major premise of this work is that the proposed learned model can generalize to the dynamics of streaming workloads out-of-the-box. This means a model once trained can accurately predict performance metrics such as latency and throughput even if the characteristics of the data and workload or the deployment of operators to hardware changes at runtime. That way, the model can be used to solve tasks such as optimizing the placement of operators to minimize the end-to-end latency of a streaming query or maximize its throughput even under varying conditions. Our evaluation on a well-known DSPS, Apache Storm, shows that the model can predict accurately for unseen workloads and queries while generalizing across real-world benchmarks
PANDA: Performance Prediction for Parallel and Dynamic Stream Processing
Pratyush Agnihotri, Boris Koldehofe, Carsten Binnig, Manisha Luthra
PANDA focuses on selection of appropriate resources for parallel stream processing under the presence of dynamic and unseen workloads. The main idea is to predict optimal parallelism degrees for resources using zero-shot cost models for massively parallel operations of DSPS. A core challenge we aim to solve in this work is to predict when and how to scale data-parallel and task-parallel operations in DSPS to ensure the elasticity demands. Our preliminary evaluation on a widely used DSPS, Apache Flink, shows a high influence of parallelism mechanisms on the performance characteristics of DSPS for different workloads that needs to be learnt and adapted by PANDA's learned model.