“Branches are all you need: data versioning framework for machine learning”.
If you work with git and work with binary files, small changes give you a completely new copy. With a couple of changes, you quickly get a huge repository. Especially when you’re a machine learning
A solution could be an object store (like amazon s3). Name directories like versions, for intance. But quickly it becomes a mess. Oh, and which version in the object store matches the versioned model parameters in git? Aargh.
What is proper data versioning? The answer is git. That’s the only solution to keep track of everything. The core is to use branches. The branches effectivly contain links to files stored in object storage. There are tools for it like mlflow. You tell mlflow to upload/download the data, from your config in git. An alternative is git lfs for large files.
The main branch is for the readme, the documentation, definition of the business problem, onboarding information. There’s no data or code in here.
Data branches. First raw. Data first ends up here and never deleted. Branches point at specific versions/collections.
Development branches. This is a combination of code and data. But don’t change the data, only the code. Make sure you’re only developing in a dev branch, not in a data branch: you want to keep the two activities separated.
When finished, you can tag what you have.
Stable branches. For (re-)training and running tests.
Analysis branch. Mostly for comparing models, checking algorithms.
He has a demo at https://xethub.com/sdssio/branches-demo .
My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.
Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):