Extended Seminar - AI for Data Management

This seminar is about how AI can be used for data management. This year, the seminar focuses on two topics: learned DBMS components and AI for data engineering tasks. The course starts with a mini lecture series to provide the necessary background for the two practical tasks that follow.

  • Task 1 (Learned Cost Model): You will develop a learned cost model for DBMS that predicts the execution costs of a given query. To support this, you receive training data and stencil code with the necessary framework for building the learned cost model.
  • Task 2 (LLMs for Data Engineering): You will investigate how well LLMs can solve classical data engineering tasks. Based on existing literature, each student selects one particular data engineering task and evaluates how well an LLM can solve it.

Organization

Last offered Winter Semester (24/25)
Lecturer Prof. Carsten Binnig, Dr. Manisha Luthra, Dr. Anupam Snaghi
Assistants Roman Heinrich, Eduardo Reis, Jan-Micha Bodensohn, Liane Vogel
Examination See Moodle
The kickoff meeting will be on the 15th of October, 2024 from 09:50-11:30 in room S207/167.

Course Infos

Below, you find some general information about the seminar. For all information regarding this year’s seminar (including important dates), please check the Moodle course linked above. Also make sure that you are registered in TUCaN.

Prerequisites:

You should have basic knowledge in machine learning and programming in Python. Advanced knowledge in data management and database systems from courses such as SDMS or ADMS as well as machine learning courses is also helpful.

Seminar Topic:

Database management systems (DBMS) in the cloud are the backbone for managing large volumes of data efficiently and thus play a central role in business and science today. For providing high performance, many of the most complex DBMS components such as query optimizers or schedulers involve solving non-trivial problems.

To tackle such problems, very recent work has outlined a new direction of so-called learned DBMS components where AI-based methods are used to replace and enhance core DBMS components, which has been shown to provide significant performance benefits. This route is particularly interesting since Cloud vendors such as Google, Amazon, and Microsoft are already applying these techniques to optimize the performance of their cloud data systems.

Besides learned DBMS components, AI has been used to improve many other data management-related tasks. For example, classical data engineering tasks like error detection, missing value imputation, and data augmentation typically cause high manual overheads and can be automated with AI. Finally, AI has also been used to extend databases through better data access interfaces (e.g., natural language querying and chatbots for data) or by supporting data beyond structured tabular data (i.e., text and images).

This seminar is designed to introduce students to the foundational concepts of using AI for data management. The course will include a mini lecture series that provides the necessary background on AI in data management, preparing students for the seminar tasks. The seminar is divided into two parts, each focusing on key themes as introduced above: learned DBMS components and the application of AI for data engineering. Students will engage in practical tasks related to these topics.