Reputation: 45941
I have one (for now) large text data file of 120 MB.
Is it a poor practice to put it in the repo? Does it affect search functionality on GitHub?
It seems like it is a bad idea because the entire source code is only 900 lines.
Not planning on updating the file.
Could put it on Dropbox or Google Docs, but then it is separate from the repo.
If not GitHub, is there a better way of managing/backing up large data files?
Upvotes: 24
Views: 19374
Reputation: 1987
There are good ways to handle this situation. For example when I am working on a project that analyses data, especially after cleaning and preprocessing steps, its lame to share the code but not the data set (within reason of course for size of data set). Here is what I have found:
git lfs Large File Storage this allows you to track and commit and push binaries, data files, images, etc to the same remote and you don't have to pull everything if you clone the repo.
git-annex uses its own commands so you will be committing the repo and annexed files separately. It looks great for managing these files on any remote such as a hard drive, s3, google drive and many more.
Someone has made a nice comparison of git-annex vs git lfs here, and this post compares several method in short form.
They both seem great, git annex is more mature currently, but git lfs is developed by github which I use, which is why I am using git lfs.
Upvotes: 7
Reputation: 7279
pages.github.com is the correct place ? no
github:help answers this question very clearly (I was looking for them also)
https://help.github.com/articles/what-is-my-disk-quota
Large media files
Binary media files do not get along very well with Git. For these files it's usually best to use a service specifically designed for what you're using.
For large media files like video and music you should host the files yourself or using a service like Vimeo or Youtube.
For design files like PSDs and 3D models, a service like Dropbox usually works quite nicely. This is what GitHub's designers use to stay in sync; Only final image assets are committed into our repos.
and https://help.github.com//articles/distributing-large-binaries
Upvotes: 2
Reputation: 8172
If the file does not need to be under version control then I would be reluctant to place it on git hub.
Update based on discussions ...
From http://git-scm.com/book/en/Customizing-Git-Git-Hooks
After you run a successful git checkout, the post-checkout hook runs; you can use it to set up your working directory properly for your project environment. This may mean moving in large binary files that you don’t want source controlled, auto-generating documentation, or something along those lines.
So using this mechanism you could download the externally stored data file to your working copy.
Upvotes: 3
Reputation: 19722
Put it in the repo if:
1- you want to keep track of the changes
2- it is actually a part of the project and you want people to receive it when they clone the repo
Don't put it in the repo (use .gitignore to exclude it) if:
1- it changes often but the changes are not meaningful and you don't want to keep the history
2- it is available online or you can make it available online and put a link or something in the repo for people to know where to find it
Dropbox is good if you don't have lots of people downloading it, Amazon S3 is your best bet for hosting it.
Upvotes: 7
Reputation: 129762
you can put it on github, but I would recommend putting it in another repository and link to it via submodules. This will ensure that the file does not get transferred/adjusted unless you explicitly do so via the submodule command.
Upvotes: 3