Scalable Data Management in the Presence of High-Speed Networks

DB4ML

An In-Memory Database Kernel with Machine Learning Support

Based on a recent survey from Kaggle, relational data is the most commonly used type of data in data science teams and is typically used for building classical machine learning models such as regression models, decision trees, or unsupervised algorithms such as PageRank or clustering algorithms.
The standard approach for applying machine learning (ML) algorithms on relational data is to first select the relevant entries with an SQL query and export them from the database into an external ML tool.
Then the ML algorithm is run over the extracted data – outside the DBMS – using statistical software packages or ML libraries such as R, SPSS, or Scikit-learn.

In this project, we revisit the question of how ML algorithms can be best integrated into existing DBMSs to not only avoid expensive data copies to external ML tools but also to comply with regulatory reasons. The key observation is that database transactions already provide an execution model that allows DBMSs to efficiently mimic the execution model of modern parallel ML algorithms.

We define a programming model for user-defined iterative transactions that supports a wide class of ML algorithms and allows developers to easily integrate new ML algorithms into a DBMS.
We provide the implementation of our transactional database kernel called DB4ML including a storage manager and execution engine that can efficiently run parallel ML algorithms.
Finally, we provide an evaluation of DBML in our recent paper based on two popular use-cases, namely PageRank and SGD.

The main result of this project will be an easy-to-use abstraction that allows database users to implement their own machine learning algorithms in a classical relational database.