Jan-Micha Bodensohn M.Sc.
Doctoral Researcher
Working area(s)
Machine Learning for Data Engineering
Contact
jan-micha.bodensohn@tu-...
Work
S2|02 D111
Hochschulstr. 10
64289
Darmstadt
I am a doctoral student supervised by Prof. Carsten Binnig at the Data and AI Systems Lab of the Technical University of Darmstadt. My research aims to fully automate data engineering through Large Language Models (LLMs), from reasoning about the required data to transforming it for queries and applications.
I joined the Data and AI Systems Lab as a doctoral student in 2023 after completing my Bachelor's and Master's degrees in Computer Science at TU Darmstadt. From 2023 to 2025, I was also employed as a researcher at the German Research Center for Artificial Intelligence (DFKI). Prior to my doctoral studies, I worked as a student research assistant at the Data Management Lab at TU Darmstadt.
My research aims to fully automate data engineering through Large Language Models (LLMs), from reasoning about the required data to transforming it for queries and applications. I currently explore LLMs for data engineering in enterprise settings and work towards overcoming their shortcomings on tabular data, such as their high costs and limitations in handling large data lakes.
Highlight projects:
Executing SQL queries over data lakes is challenging because of the disorganized data and lack of relational schemas. Our approach R2D2 addresses this by automatically constructing small relational databases tailored to individual queries. As a second step, we use LLM agents to iteratively rewrite the SQL queries to run directly on the data lake tables. Our end goal is to make data lakes as easy to query as relational databases.
aiDM’24 DASP’25 NOVAS’25
Using LLMs for data engineering in enterprise settings is much harder than current research on public benchmarks makes it look. We show that enterprise-specific challenges like the much larger scale of the data, the higher complexity of real-world scenarios, and the need for domain-specific knowledge can reduce performance by as much as 80%. We are now working on bridging this gap between research and industry.
VLDB’26 TRL’24 TaDA’24 (Best Short Paper) (PDF file) (opens in new tab)
WikiDBs is a large-scale corpus of 100k relational databases crawled from Wikidata. It enables the training of multi-table foundation models as well as the evaluation of table retrieval and integration approaches.
NeurIPS’24 (Spotlight Paper) (PDF file) (opens in new tab)