Our Paper “WannaDB: Ad-hoc SQL Queries over Text Collections” was accepted at BTW'23
20th Conference on “Database Systems for Business, Technology and Web” (BTW2023)
2023/01/16
In this paper, we propose a new system called WannaDB that allows users to interactively perform structured explorations of text collections in an ad-hoc manner. Extracting structured data from text is a classical problem where a plenitude of approaches and even industry-scale systems already exist. However, these approaches lack in the ability to support the ad-hoc exploration of texts using structured queries. The main idea of WannaDB is to include user interaction to support ad-hoc SQL queries over text collections using a new two-phased approach. First, a superset of information nuggets from the texts is extracted using existing extractors such as named entity recognizers. Then, the extractions are interactively matched to a structured table definition as requested by the user based on embeddings. In our evaluation, we show that WannaDB is thus able to extract structured data from a broad range of (real-world) text collections in high quality without the need to design extraction pipelines upfront
We will present this paper at the 20th Conference on “Database Systems for Business, Technology and Web” (BTW2023).
The BTW conference is the most important database conference in the German-speaking area. Every two years since 1985, it has served as a central forum for the exchange of information between scientists, practioners, and users on topics of database and information system technology.
This paper has reproducible results and earned the reproducibility badge. You can find the code at https://link.tuda.systems/wannadb
Authors:
- Benjamin Hättasch (TU Darmstadt)
- Jan-Micha Bodensohn (TU Darmstadt)
- Liane Vogel (TU Darmstadt)
- Matthias Urban (TU Darmstadt)
- Carsten Binnig (TU Darmstadt)