Pragy Agarwal
Pragy Agarwal

Reputation: 579

Efficiently committing small changes to a large file in Git?

Say that you have a 100MB text file, and you wish to commit changes to this file periodically to git. The changes are small and frequent.

Is there any efficient way of handling this with Git?

The normal way of staging and committing the file will cause git to read & write the entire file again, irrespective of how small your change is.

Is there a way of making a commit using only a "diff" of the changes?

Upvotes: 1

Views: 783

Answers (3)

LeGEC
LeGEC

Reputation: 51840

git will indeed read the entire content of the file to compute it's hash, for example, or when it diffs the file with another version.

For storage however : git already has a "diff" storage format. You can explicitly ask git to pack files by running git gc.


If you need performance :

  • use a program that computes the diff, and store only the diffs in git,
  • perhaps git is not the appropriate tool for your use case

Upvotes: 1

torek
torek

Reputation: 488103

Is there any efficient way of handling this with Git?

No.

The hash ID of any Git object is a cryptographic checksum of its contents. You could speed up the computation a bit by having saved checksums for the first N megabytes, for instance, so that if you change some bytes 50 MB into the 100 MB object, you can compute the new blob object checksum by starting with the known 50 MB checksum and hence computing only about half as much of a checksum. But you'll still need to either store the entire loose object or implement your own pack-file algorithm as well.

Git is much better at handling a larger number of smaller files. For instance, instead of 1 100-MB file, you could store 1000 100-kB files. If you need to modify some bytes in the middle, you're then changing only a single file, or at most two files, each of which is smaller and will become a smaller loose object that can be summed relatively quickly.

Upvotes: 3

Stanislav Bashkyrtsev
Stanislav Bashkyrtsev

Reputation: 15308

There are 2 formats of Git objects - Loose ones and Packed ones. When you initially add and commit file it adds another Loose object, which is a full blob. But Git can also turn this into Packed object (e.g. when pushing) which stores the diff. See answers here: What are the "loose objects" that the Git GUI refers to?.

After committing the file you can run git gc so that Git packs and removes old Loose object. Not sure if it would remove the old one right away or it will start doing this only after some time.

Upvotes: 2

Related Questions