Skip to main content Skip to main navigation

Publication

Pythagoras: Semantic Type Detection of Numerical Data Using Graph Neural Networks

Sven Langenecker; Christoph Sturm; Christian Schalles; Carsten Binnig
In: Michael Leyer; Johannes Wichmann (Hrsg.). Lernen, Wissen, Daten, Analysen (LWDA) Conference Proceedings, Marburg, Germany, October 9-11, 2023. GI-Workshop-Tage "Lernen, Wissen, Daten, Analysen" (LWDA), Pages 146-152, CEUR Workshop Proceedings, Vol. 3630, CEUR-WS.org, 2023.

Abstract

Detecting semantic types of table columns is a crucial task to en- able dataset discovery in data lakes. However, prior semantic type detection approaches have primarily focused on non-numerical data despite the fact that numerical data play an essential role in many real-world enterprise data lakes. Therefore, existing models are typically rather inadequate when applied to data lakes that contain a high proportion of numerical data. In this paper, we introduce Pythagoras, our new learned semantic type detection approach specially designed to support numerical along with non-numerical data. Pythagoras uses a GNN in combination with a novel graph representation of tables to predict the semantic types for numerical data with high accuracy. In our experiments, we compare Pythagoras against five state-of-the-art approaches using two different datasets and show that our model significantly outperforms these baselines on numerical data. In comparison to the best existing approach, we achieve F1-Score increases of around +22%, which sets new benchmarks.

More links