ThoughtProcess
ThoughtProcess

Reputation: 449

Will git store diffs of binary files that change in content, but never change size?

I am interested in storing an EEPROM HEX file of fixed size in git. The files will NEVER change size, but they will change content frequently.

If I add an EEPROM file to git and commit it, then I change a few bytes in the file, will git store this change efficiently over dozens or hundreds of commits?

In my research on this issue, I've run across some thorough discussions on the topic, but most of them seem to deal with files like PDFs and MP3s which nobody expects to stay the same or be comparable in a diff. I wonder if EEPROM HEX files would be treated differently since the file size stays the same?

EDITED (again)

Some initial observations... (Kudos to Krumelur for the "just try it" encouragement!)

The file that I am testing is a 7MB Intel HEX file. Based on the output from git, it appears to treat this file as a text file:

$ git commit -m "Changed a single byte."
[master bc2958b] Changed a single byte.
1 file changed, 1 insertion(+), 1 deletion(-)

The diff output matches as well:

$ git show bc2958b
commit bc2958b[...]
Author: ThoughtProcess <[email protected]>
Date:   Wed Jul 31 11:53:41 2013 -0500

    Changed a single byte.

diff --git a/test.hex b/test.hex
index fbdeed4..04d19b6 100644
--- a/test.hex
+++ b/test.hex
@@ -58,7 +58,7 @@
 :20470000000000000000000000000000000000000000000000000000E001EDD0D9310D00E4
 :20472000400200000080000000000000000000000000000000000000E002EDD0CF310D000B
 :20474000400200000080000000000000000000000000000000000000E0036D0063040D00D3
-:2047600040020000008000000000000000000000000000000000000000A0FF2F06801B0FF9
+:2047600040020000008000000000000000000000000000000000000000A0FF2G06801B0FF9
 :2047800000E01D007A00820F3CFB000000000000000000000000000000A0FF8F06801B1FEC
 :2047A00000E01D006A00821F3CFB000000000000000000000000000000A0FF6F06801B8F7C
 :2047C00000E01D005A00821F3CFB000000000000000000000000000000A0FF8F06801BDFFC

After 7 commits, the repository size is now 21MB. Here's the strange thing, I've noticed that the repository seems to grow by a roughly linear size (2MB) with each commit. Is that simply how git is designed to work? Or is it not storing the incremental differences as text like I'd expect?

Upvotes: 5

Views: 2024

Answers (3)

qwr
qwr

Reputation: 10891

We can test if git efficiently stores two very similar binaries. Testing on git version 2.9.2.windows.1 (extra output removed for clarity):

$ git init
$ du -bs .git
15243   .git
$ head -c 10MB < /dev/urandom > random.bin
$ git add random.bin
$ git commit -m "Add random.bin"
$ du -bs .git
10018971        .git
$ git gc
$ du -bs .git
10020319        .git

Git stores the 10 MB binary file with about 20 KB overhead (note that the original file still takes up another 10 MB in the directory). Now if we modify the file a few bytes with a text editor (Write byte at address (hexedit/modify binary from the command line) if you prefer):

$ vim random.bin  # modify a few bytes
$ git add random.bin
$ git commit -m "Modify random.bin a little"
$ du -bs .git
20023953        .git
$ git gc
$ du -bs .git
10021228        .git

Before git gc, both versions were stored entirely. Afterwards, git packs the two files very efficiently. Git packfiles are described in much more detail at https://codewords.recurse.com/issues/three/unpacking-git-packfiles and https://git-scm.com/docs/pack-format

$ git verify-pack -v .git/objects/pack/pack-4bc29bb6848c64b94ba6074939c851b83240dd60.pack
4ea81b3f5d4f0ef5ddbc8e9adaac73b60c0899c4 commit 201 151 12
9e2bafb8cd3a4f0fc6d0773611a92ac1b14303b0 commit 141 111 163
f2aa8f26c4dcad0f73a03c958b2eb1c0fc6cb8fd blob   10000008 10003073 274
0b650d78653ec22c19453264384ed644fc956f42 tree   38 49 10003347
bd143b12cdec07b9aa68875052c01ae6d041f28f tree   38 49 10003396
fd1a966f4b0acc4c77ab85cb81841ebb0ee290ea blob   470 309 10003445 1 f2aa8f26c4dcad0f73a03c958b2eb1c0fc6cb8fd
non delta: 5 objects
chain length = 1: 1 object
.git/objects/pack/pack-4bc29bb6848c64b94ba6074939c851b83240dd60.pack: ok

The last blob is deltified and it references the SHA-1 of the original binary.

A similar test is done in this answer.

Upvotes: 4

Sampo Smolander
Sampo Smolander

Reputation: 1736

git is actually storing a new full copy of your file(s) somewhere under .git/objects so your repository does indeed grow linearly. You can run git gc to make git pack the repository. In case of your data, git should be able to pack really efficiently, and your repository should get much smaller. (git will also automatically run git gc occasionally.)

Upvotes: 7

Carl Norum
Carl Norum

Reputation: 224864

If you're really storing Intel HEX format files, you don't have anything to worry about - they are text files. They just happen to represent binary data.

From the wikipedia entry:

The format is a text file, with each line containing hexadecimal values encoding a sequence of data and their starting offset or absolute address.

Editorial note: The change you made in your test isn't valid - G is not a hexadecimal digit, and besides that, you didn't update the checksum.

Upvotes: 1

Related Questions