Data Provenance and Reproducible ML Pipelines in Python

Data provenance plays an important role in data science. It refers to the act of recording every change to input data, feature transformations, code, model parameters, and external dependencies. By doing this, it’s possible to perfectly reproduce the output of your models at any given time in the past and trace how that output was produced. This is vital for debugging misbehaving models in production and explaining why models made such predictions. This is also known as ML reproducibility and explainability, which are requirements in regulated areas like automated stock trading and medicine.

Python has become a top choice for data science. As such, it has many open source libraries and tools for implementing data provenance in ML pipelines.

In this talk, I will show you how to add end-to-end data provenance and reproducibility to your Python ML pipelines, using these open source tools on a real-world ML pipeline. This includes recording all input data collection, transformation, model training, and production model prediction.

Finally, I will show you how to use this recorded data to analyse the evolution of an ML pipeline over time and easily reproduce the output of this pipeline at any time in history.

Donald Whyte

Engineers Gate

A senior software engineer at Engineers Gate, a New York-based quantitative hedge fund
At Engineers Gate he builds real-time trading systems and large-scale data pipelines
An avid Python and Rust developer and data enthusiast
Organised hackathons in several countries
Worked at Bloomberg L.P., where he built core, high performance database infrastructure that's still used across the firm globally
GitHub