Our paper “DeepDB: Learn from Data, not from Queries!” was accepted to VLDB2020 Tokyo

The typical approach for learned DBMS components is to capture the behavior by running a representative set of queries and use the observations to train a machine learning model. We propose to learn a pure data-driven model.

2020/02/20

The typical approach for learned DBMS components is to capture the behavior by running a representative set of queries and use the observations to train a machine learning model. This workload-driven approach, however, has two major downsides. First, collecting the training data can be very expensive, since all queries need to be executed on potentially large databases. Second, training data has to be recollected when the workload and the data changes. To overcome these limitations, we take a different route: we propose to learn a pure data-driven model which learns the basic characteristics of a given database.

As a result, our data-driven approach not only supports ad-hoc queries but also updates of the data without the need to retrain when the workload or data changes. Indeed, one may expect that this comes at a price of lower accuracy since workload-driven models can make use of more information. However, as our empirical evaluation demonstrates our data-driven approach can not only provide better accuracy than state-of-the-art learned components for cardinality estimation and AQP but also generalizes better to unseen queries.