glaserl
glaserl

Reputation: 92

When is a single file considered too large for git?

Summary

I have a git repository to track my courses at uni. Some lecture slides in .pdf are rather large sometimes (20-30MB), which made me wonder when the usual wisdom don't put large files in git! starts to apply?

I give my situation as an example, but really am interested in the general limits of file size/frequency of changes that one should take into account.

Example Case

In that repository I have a directory for each course I'm taking, each directory containing code for assignments and projects. I also want to have the slides of each course in there for easy syncing.

As far as I know, GitHub blocks files >1GB. However, the git repo I am using is hosted on a private 1 TB machine I share with a friend, so I guess other limits apply?

In general, I would never add databases >100MB to git, but does this rule apply for 20-50MB files (lecture slides) that will never, maybe once, change?

Upvotes: 3

Views: 1532

Answers (1)

Edward Thomson
Edward Thomson

Reputation: 78653

Let's assume for a moment that you want to keep all of these files within one tree and that you want to use git to manage them for whatever reason (because it's simpler for you, the tools are ubiquitous in your environment, etc).

The typical advice when people talk about large files is to point them to Git Large File Storage (LFS). Git LFS works by letting you specify these large files and it will remove them from the repository itself and put them in a separate LFS storage location. When you clone the repository, you'll get the metadata about the files, enough information that when you checkout a branch, git-lfs can download those large files from the LFS storage area and put them on disk.

This is helpful because you don't need to fetch all that data, multiple old versions of large files or the large files in other branches. You only download exactly what you need to check out HEAD.

Let's compare Git LFS to "pure" git in a few areas:

Downloads

In your scenario, you're not modifying these files. You have a single revision, and you want it checked out, always. Thus the approximate bandwidth and time used by git-lfs and regular git is... the same.

(This assumes that these files don't compress well, or share much in common, which is a pretty good guess. But if it's a poor guess then git might end up being more efficient than Git LFS based on the way it sends data.)

On-disk Storage

With either solution, obviously you'll need enough disk space to store the checked out version of the file in your working directory. However, with regular git, you'll also need to store a copy as a git "object" in the git repository.

This points to git's existence as a distributed version control system, when you clone a repository, you'll a copy of each version of each file that's existed in the repository.

As a result, if you check in a 10 GB file, you'll need 20 GB: 10 GB to store it within the working directory where you can access it, and another 10 GB to store it as an object in the Git repository. (This, again, assumes that the contents don't compress well.)

Hosting

As you noted, some hosting providers put limits on the size of your repository. Since you're hosting this on your own server, you only need to ensure that you have enough disk space and bandwidth enough to clone.

So in your scenario, as long as you have adequate disk space for twice the size of the contents of the current working directory, then git (without Git LFS) is an excellent choice.

Upvotes: 2

Related Questions