Paper accepted to EDBT 2024

Pythagoras: Semantic Type Detection of Numerical Data in Enterprise Data Lakes


Authors: Sven Langenecker, Christoph Sturm, Christian Schalles, Carsten Binnig

The paper introduces Pythagoras, a novel semantic type detection approach designed for tables in data lakes that predominantly consist of numerical data. By leveraging a Graph Neural Network (GNN) and a unique graph representation of tables, Pythagoras achieves high accuracy in predicting semantic types for numerical data. Experimental results demonstrate that Pythagoras outperforms five state-of-the-art approaches, achieving significant F1-Score increases of around +22% compared to the best existing method, thereby setting new benchmarks in numerical data detection within data lakes.