Data preprocessing plays a crucial role in most data science applications. It includes the cleaning, transformation and generally manipulation of datasets in order to provide a data format which is suitable for downstream machine learning models. Preprocessing data for machine learning is often an iterative process and the right steps are found through trial-and-error. In this project, we aim to support data scientists and domain experts in building their preprocessing pipelines. For this, we develop tools which make the changes conducted by single steps more explicit. This helps in understanding how preprocessing changes the data. Also, we plan to develop approaches which visualize how different preprocessing configurations affect the model which is trained with the processed data.
Git-Repository: https://github.com/bastistrasser/hawk
Publication: Sebastian Strasser, Meike Klettke: Transparent Data Preprocessing for Machine Learning, HILDA@SIGMOD, 2024 (DOI)