Modern machine learning techniques can significantly improve classical database components such as the query optimizer to improve the overall performance. To train these learned DBMS components a representative set of queries has to be executed and used to train a machine learning model. This workload-driven approach, however, has two major downsides. First, collecting the training data can be very expensive, since all queries need to be executed on potentially large databases. Second, training data has to be recollected when the workload or the database changes.
In this project, we instead propose a data-driven approach for learned DBMS components which directly supports changes of the workload and data without the need of retraining. To achieve this, we learn deep probabilistic models over different parts of the database schema and show how to combine them efficiently. Indeed, one may now expect that this comes at a price of lower accuracy since workload-driven approaches can make use of more information. However, we demonstrated that our data-driven approach not only provides better accuracy than state-of-the-art learned components for cardinality estimation but also generalizes better to unseen queries.