The predominant paradigm today for learned DBMS components is workload-driven learning, i.e., running a representative set of queries on the database and use the observations to train a machine learning model. This approach, however, has two major downsides. First, collecting the training data can be very expensive, since many queries have to be executed on potentially large databases. Second, training data has to be recollected when the workload or the database changes.
Hence, in this talk we present our vision to tackle the high costs and inflexibility of workload-driven learning. First, we introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing (AQP), many tasks such as physical cost estimation cannot be supported. We thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS systems. Here, the idea is to train models that generalize to unseen databases out-of-the-box, i.e., without requiring workloads as training data or retraining. The idea is to train a model that has observed a variety of workloads on different databases and can thus generalize. Initial results on the task of physical cost estimation suggest the feasibility of this approach. Finally, we discuss research opportunities which are enabled by zero-shot learning.
Benjamin's talk will be held as part of the at LADSIOS workshop on Monday, August 16 at 11:20h (UTC+2), see the workshop program at https://www.ladsios.org/schedule.
LADSIOS 2021 is co-hosted with VLDB 2021.
- Benjamin Hilprecht (TU Darmstadt)