B Seven
B Seven

Reputation: 45941

How to manage large data files with GitHub?

I have one (for now) large text data file of 120 MB.

Is it a poor practice to put it in the repo? Does it affect search functionality on GitHub?

It seems like it is a bad idea because the entire source code is only 900 lines.

Not planning on updating the file.

Could put it on Dropbox or Google Docs, but then it is separate from the repo.

If not GitHub, is there a better way of managing/backing up large data files?

Upvotes: 24

Views: 19374

Answers (5)

Merlin
Merlin

Reputation: 1987

There are good ways to handle this situation. For example when I am working on a project that analyses data, especially after cleaning and preprocessing steps, its lame to share the code but not the data set (within reason of course for size of data set). Here is what I have found:

  • git lfs Large File Storage this allows you to track and commit and push binaries, data files, images, etc to the same remote and you don't have to pull everything if you clone the repo.

  • git-annex uses its own commands so you will be committing the repo and annexed files separately. It looks great for managing these files on any remote such as a hard drive, s3, google drive and many more.

Someone has made a nice comparison of git-annex vs git lfs here, and this post compares several method in short form.

They both seem great, git annex is more mature currently, but git lfs is developed by github which I use, which is why I am using git lfs.

Upvotes: 7

Sérgio
Sérgio

Reputation: 7279

pages.github.com is the correct place ? no

github:help answers this question very clearly (I was looking for them also)

https://help.github.com/articles/what-is-my-disk-quota

Large media files

Binary media files do not get along very well with Git. For these files it's usually best to use a service specifically designed for what you're using.

For large media files like video and music you should host the files yourself or using a service like Vimeo or Youtube.

For design files like PSDs and 3D models, a service like Dropbox usually works quite nicely. This is what GitHub's designers use to stay in sync; Only final image assets are committed into our repos.

and https://help.github.com//articles/distributing-large-binaries

Upvotes: 2

Rob Kielty
Rob Kielty

Reputation: 8172

If the file does not need to be under version control then I would be reluctant to place it on git hub.

Update based on discussions ...

From http://git-scm.com/book/en/Customizing-Git-Git-Hooks

After you run a successful git checkout, the post-checkout hook runs; you can use it to set up your working directory properly for your project environment. This may mean moving in large binary files that you don’t want source controlled, auto-generating documentation, or something along those lines.

So using this mechanism you could download the externally stored data file to your working copy.

Upvotes: 3

Ali
Ali

Reputation: 19722

Put it in the repo if:
1- you want to keep track of the changes
2- it is actually a part of the project and you want people to receive it when they clone the repo

Don't put it in the repo (use .gitignore to exclude it) if:
1- it changes often but the changes are not meaningful and you don't want to keep the history
2- it is available online or you can make it available online and put a link or something in the repo for people to know where to find it

Dropbox is good if you don't have lots of people downloading it, Amazon S3 is your best bet for hosting it.

Upvotes: 7

Adam Dymitruk
Adam Dymitruk

Reputation: 129762

you can put it on github, but I would recommend putting it in another repository and link to it via submodules. This will ensure that the file does not get transferred/adjusted unless you explicitly do so via the submodule command.

Upvotes: 3

Related Questions