In the first submission cycle 3 out of 3 papers are accepted to SIGMOD 2020

IDEBench: A new Benchmark for Interactive Data Exploration / DB4ML: An In-Memory Database Kernel with Machine Learning Support / DBPal: A Fully Pluggable NL2SQL Training Pipeline

2019/11/27

IDEBench: A new Benchmark for Interactive Data Exploration

In recent years, many new query processing techniques have been developed for relational databases to better support in- teractive data exploration (IDE) of large structured datasets. To evaluate and compare database engines with regard to how well they support IDE workloads, many recent papers typically rely on self-designed workloads rather than on standard benchmarks such as TPC-H or TPC-DS. The main reason for this is that the existing benchmarks were devel- oped for classical reporting scenarios and do not enable to compare how well DBMS support the interactive exploration of large data sets. In this paper we therefore present a new benchmark called IDEBench. Driven by the ndings of ve user studies we have conducted in the last years, we derive the benchmark that evaluates database engines based on workloads derived from IDE scenarios. In order to demon- strate the applicability of IDEBench, we show a rst ex- perimental study that uses the benchmark to evaluate ve di erent database engines, and present and discuss their performance.

DB4ML: An In-Memory Database Kernel with Machine Learning Support

Many interesting datasets, such as booking and business data, are still being stored and processed by relational databases. To gain business insights without expensive data copying existing attempts to support machine learning (ML) within database systems are mainly based on extending the SQL query engine. However, modern parallel machine learning algorithms make use of fine-grained concurrent execution and relaxed consistency schemes, which is a behavior that can not be mimicked by the SQL query engine. In this paper, we hence take a different route to support parallel ML-algorithms within database systems. The key observation is that database transactions already enable a fine-grained concurrent execution model. As a main contribution, this paper presents DB4ML, an in-memory database kernel with a new execution scheme based on transactions to efficiently support iterative ML-algorithms inside the DBMS. Our experimental evaluation shows that DB4ML can support ML-algorithms inside a DBMS with the efficiency of modern parallel ML-algorithms.

DBPal: A Fully Pluggable NL2SQL Training Pipeline

Natural language has long been a promising alternative query interface to databases that enable non-expert users to formulate complex questions in a more concise manner. Recently, deep learning techniques have gained traction as a way to translate natural language to SQL since similar ideas have been successful in related domains (e.g., English to Spanish). However, the core problem with existing deep learning approaches is that they require an enormous amount of training examples in order provide accurate translations. Such training data is extremely expensive to curate, since it generally requires humans to manually annotate natural language with SQL queries. Based on these observations, we propose DBPal, a new approach that augments existing deep learning techniques in order to improve the performance of natural language to SQL translation. More specifically, we present a novel training pipeline that automatically generates synthetic training data in order to improve translation accuracy and create a model that is tailor made to the target database. As we show, our training pipeline applied to existing deep learning techniques is able to improve the accuracy of state-of-the-art natural language to SQL translation tasks.