Reputation: 42100
I am coming from Java background and completely new at Python.
Now I have got a Python project that consists of a few Python scripts and pickle
files stored in Git. The pickle files are serialized sklearn models.
I wonder how to organize this project. I think we should not store the pickle files in Git. We should probably store them as binary dependencies somewhere.
Does it make sense ? What is a common way to store binary dependencies of Python projects
Upvotes: 7
Views: 4047
Reputation: 11489
Git is just fine with binary data. For example, many projects store e.g. images in git repos.
I guess, the rule of thumb is to decide whenever your binary files are source material, an external dependency, or an intermediate build step. Of course, there are no strict rules, so just decide how you feel about them. Here are my suggestions:
If they're (reproducibly) generated from something, .gitignore
the binaries and have scripts that build the necessary data. It could be in the same, or in a separate repo - depending on where it feels best.
Same logic applies if they're obtained from some external source, e.g. an external download. Usually, we don't store dependencies in the repository - we only keep references to them. E.g. we don't keep virtualenvs but only have requirements.txt file - the Java world analogy is (a rough approximation) like not having .jars but only pom.xml or a dependencies section in build.gradle.
If they can be considered to be a source material, e.g. if you manipulate them with Python as an editor - don't worry about the files' binary nature and just have them in your repository.
If they aren't really a source material, but their generation process is really complicated or takes very long, and the files aren't meant to be updated on a regular basis - I think it won't be terribly wrong to have them in the repo. Leaving a note (README.txt or something) about how the files were produced would be a good idea, of course.
Oh, and if the files are large (like, hundreds of megabytes or more), consider taking a look at git-lfs.
Upvotes: 8