Research Seminar

Research Seminar

Our group hosts various talks about new research projects and ideas. The talks are presented in our research seminar that we currently conduct together with the Hasso Plattner Institute. You can find a list of previous talk at our group below:

Talks

Syndata – Neural Data Completion for Relational Databases – Benjamin Hilprecht (Technical University Darmstadt) – 08.02.2021

Abstract:

Classical approaches for OLAP assume that the data of all tables is complete. However, in case of incomplete tables with missing tuples, classical approaches fail since the result of a SQL aggregate query might significantly differ from the results computed on the full dataset. Today, the only way to deal with missing data is to manually complete the dataset which causes not only high efforts but also requires good statistical skills to determine when a dataset is actually complete.

In this paper, we propose an automated approach for relational data completion called Syndata using a new class of (neural) schema-structured completion models that are able to synthesize data which resembles the missing tuples. As we show in our evaluation, this efficiently helps to reduce the relative error of aggregate queries by up to 390% on real-world data compared to using the incomplete data directly for query answering.

Are All Parameters Served Equally? Evaluating Parameter Server Efficiency in ML Systems – Ilin Tolovski (Hasso Plattner Institute) – 25.01.2021

Abstract:

Parameter servers (PS) are one of the most frequent solutions used for model synchronization and storage in distributed machine learning setups. In order to guarantee the convergence of the trained model, they need to achieve efficient and timely parameter synchronization between workers. There are multiple factors that play a significant role in the performance and efficiency of parameter servers, such as, the update and execution strategies, parameter locality, the trade-off between the computation and communication time, etc. In this project, we inspect how these design decisions impact the overall performance of PS as a part of the machine learning pipeline. To this end, we propose a set of micro and end-to-end benchmarks that capture the performance differences between several state-of-the-art PS implementations. In our study we include both vanilla libraries (ps-lite, LAPSE) and machine learning frameworks (Tensorflow, MXNet) that incorporate PS as their synchronization structure for distributed training.

Hiding data stalls with coroutine-oriented transaction execution – Tianzheng Wang (Simon Fraser, CA) – 14.12.2020

Abstract:

As the speed gap between memory and CPU continues to widen, memory accesses are becoming a major overhead in pointer-rich data structures, such as B-trees, hash tables and linked lists, which are important building blocks of database systems. Software prefetching techniques have been proposed as an effective way to hide stalls, by careful scheduling and interleaving that overlap data fetching and computation. Yet they require a vastly different multi-key interface, breaking backward compatibility. It was unclear how these techniques could be applied in a database engine, either.

In this talk, we will share our experience on adopting software prefetching using recent C++20 coroutines in a full database engine. The crux is an asynchronous "coroutine-to-transaction" paradigm that takes a departure from the traditional, synchronous "thread-to-transaction" execution model. Coroutine-to-transaction reduces database kernel changes and maintains backward compatibility, while retaining the performance benefits of software prefetching. With coroutine-to-transaction, we build CoroBase, a coroutine-oriented database engine. In the context of CoroBase, we further discuss the new execution model's impact on database engine design (such as concurrency control and resource management) and highlight interesting future work.

Talk Recording

DPI: The Data Processing Interface for High-Speed Networks – Lasse Thostrup (Technical University Darmstadt) – 30.11.2020

Abstract:

In this talk, we propose the Data Processing Interface (DPI) as a way to make it easier for data processing systems to exploit high-speed networks without the need to deal with the complexity of RDMA. By lifting the level of abstraction, DPI factors out much of the complexity of network communication and makes it easier for developers to declaratively express how data should be efficiently routed to accomplish a given distributed data processing task. As we show in our experiments, DPI is able to support a wide variety of data-centric applications with high performance at a low complexity for the applications.

Efficient Integration of Persitent Memory into Key-Value Stores – Lawrence Benson (Hasso Plattner Institute) – 16.11.2020

Abstract:

In this talk, we present persistent memory (PMem) technologies as promising alternative for persistence of Key-value stores. First, we introduce PMem, emphasizing its promise of close-to-DRAM speed while also offering persistence. Afterwards we elaborate on a novel integration approach of PMem as a storage medium in Key-Value Stores.