PyCon 2019 | Machine learning model and dataset versioning practices

Speaker: Dmitry Petrov

 

Python is a prevalent programming language in machine learning (ML) community. A lot of Python engineers and data scientists feel the lack of engineering practices like versioning large datasets and ML models, and the lack of reproducibility. This lack is particularly acute for engineers who just moved to ML space.

We will discuss the current practices of organizing ML projects using traditional open-source toolset like Git and Git-LFS as well as this toolset limitation. Thereby motivation for developing new ML specific version control systems will be explained.

Data Version Control or [DVC.ORG][1] is an [open source][2], command-line tool written in Python. We will show how to version datasets with dozens of gigabytes of data and version ML models, how to use your favorite cloud storage (S3, GCS, or bare metal SSH server) as a data file backend and how to embrace the best engineering practices in your ML projects.

[1]: http://dvc.org

[2]: https://github.com/iterative/dvc

Slides can be found at: https://speakerdeck.com/pycon2019 and https://github.com/PyCon/2019-slides

Previous PyCon 2019 | Modern Solvers: Problems Well-Defined Are Problems Solved
Next PyCon 2019 | Everything at Once: Python's Many Concurrency Models