Reputation: 449
I am interested in storing an EEPROM HEX file of fixed size in git. The files will NEVER change size, but they will change content frequently.
If I add an EEPROM file to git and commit it, then I change a few bytes in the file, will git store this change efficiently over dozens or hundreds of commits?
In my research on this issue, I've run across some thorough discussions on the topic, but most of them seem to deal with files like PDFs and MP3s which nobody expects to stay the same or be comparable in a diff. I wonder if EEPROM HEX files would be treated differently since the file size stays the same?
EDITED (again)
Some initial observations... (Kudos to Krumelur for the "just try it" encouragement!)
The file that I am testing is a 7MB Intel HEX file. Based on the output from git, it appears to treat this file as a text file:
$ git commit -m "Changed a single byte."
[master bc2958b] Changed a single byte.
1 file changed, 1 insertion(+), 1 deletion(-)
The diff output matches as well:
$ git show bc2958b
commit bc2958b[...]
Author: ThoughtProcess <[email protected]>
Date: Wed Jul 31 11:53:41 2013 -0500
Changed a single byte.
diff --git a/test.hex b/test.hex
index fbdeed4..04d19b6 100644
--- a/test.hex
+++ b/test.hex
@@ -58,7 +58,7 @@
:20470000000000000000000000000000000000000000000000000000E001EDD0D9310D00E4
:20472000400200000080000000000000000000000000000000000000E002EDD0CF310D000B
:20474000400200000080000000000000000000000000000000000000E0036D0063040D00D3
-:2047600040020000008000000000000000000000000000000000000000A0FF2F06801B0FF9
+:2047600040020000008000000000000000000000000000000000000000A0FF2G06801B0FF9
:2047800000E01D007A00820F3CFB000000000000000000000000000000A0FF8F06801B1FEC
:2047A00000E01D006A00821F3CFB000000000000000000000000000000A0FF6F06801B8F7C
:2047C00000E01D005A00821F3CFB000000000000000000000000000000A0FF8F06801BDFFC
After 7 commits, the repository size is now 21MB. Here's the strange thing, I've noticed that the repository seems to grow by a roughly linear size (2MB) with each commit. Is that simply how git is designed to work? Or is it not storing the incremental differences as text like I'd expect?
Upvotes: 5
Views: 2024
Reputation: 10891
We can test if git efficiently stores two very similar binaries. Testing on git version 2.9.2.windows.1 (extra output removed for clarity):
$ git init
$ du -bs .git
15243 .git
$ head -c 10MB < /dev/urandom > random.bin
$ git add random.bin
$ git commit -m "Add random.bin"
$ du -bs .git
10018971 .git
$ git gc
$ du -bs .git
10020319 .git
Git stores the 10 MB binary file with about 20 KB overhead (note that the original file still takes up another 10 MB in the directory). Now if we modify the file a few bytes with a text editor (Write byte at address (hexedit/modify binary from the command line) if you prefer):
$ vim random.bin # modify a few bytes
$ git add random.bin
$ git commit -m "Modify random.bin a little"
$ du -bs .git
20023953 .git
$ git gc
$ du -bs .git
10021228 .git
Before git gc
, both versions were stored entirely. Afterwards, git packs the two files very efficiently. Git packfiles are described in much more detail at https://codewords.recurse.com/issues/three/unpacking-git-packfiles and https://git-scm.com/docs/pack-format
$ git verify-pack -v .git/objects/pack/pack-4bc29bb6848c64b94ba6074939c851b83240dd60.pack
4ea81b3f5d4f0ef5ddbc8e9adaac73b60c0899c4 commit 201 151 12
9e2bafb8cd3a4f0fc6d0773611a92ac1b14303b0 commit 141 111 163
f2aa8f26c4dcad0f73a03c958b2eb1c0fc6cb8fd blob 10000008 10003073 274
0b650d78653ec22c19453264384ed644fc956f42 tree 38 49 10003347
bd143b12cdec07b9aa68875052c01ae6d041f28f tree 38 49 10003396
fd1a966f4b0acc4c77ab85cb81841ebb0ee290ea blob 470 309 10003445 1 f2aa8f26c4dcad0f73a03c958b2eb1c0fc6cb8fd
non delta: 5 objects
chain length = 1: 1 object
.git/objects/pack/pack-4bc29bb6848c64b94ba6074939c851b83240dd60.pack: ok
The last blob is deltified and it references the SHA-1 of the original binary.
A similar test is done in this answer.
Upvotes: 4
Reputation: 1736
git is actually storing a new full copy of your file(s) somewhere under .git/objects
so your repository does indeed grow linearly. You can run git gc
to make git pack the repository. In case of your data, git should be able to pack really efficiently, and your repository should get much smaller. (git will also automatically run git gc
occasionally.)
Upvotes: 7
Reputation: 224864
If you're really storing Intel HEX format files, you don't have anything to worry about - they are text files. They just happen to represent binary data.
From the wikipedia entry:
The format is a text file, with each line containing hexadecimal values encoding a sequence of data and their starting offset or absolute address.
Editorial note: The change you made in your test isn't valid - G
is not a hexadecimal digit, and besides that, you didn't update the checksum.
Upvotes: 1