Reputation: 2342
Git stores files as blobs, and then uses a SHA-1 checksum as a key to find each specific blob amongst the others, similar to a filename identifying a file.
So how does this dark magic work? That is, How does one start with a text file and end up with a blob? Is a blob created by dereferencing the memory memory address of the file or something?
Upvotes: 1
Views: 82
Reputation: 488213
There's very little actual magic in Git. The one bit that is pretty magic is (are?) the various Secure Hash Algorithm (SHA) checksum designs, Git's use of these checksums, and how they form a Merkle Tree, but this is more "math magic" than anything else.
I think you're really asking "how does Git come up with the hash ID", and the answer to that one is simple:
123
.Put the printed size in decimal after the word blob
and a space. Append an ASCII NUL character, b'\0'
in Python for instance. Hash the prefix and the data, and the result is the blob's hash ID:
$ python3
...
>>> data = b"some file data\n"
>>> prefix = "blob {}\0".format(len(data)).encode("utf-8")
>>> import hashlib
>>> h = hashlib.sha1()
>>> h.update(prefix)
>>> h.update(data)
>>> h.hexdigest()
'a831035a26dd2f75c2dd622c70ee22a10ee74a65'
We can check by using Git's object hasher:
$ echo 'some file data' | git hash-object -t blob --stdin
a831035a26dd2f75c2dd622c70ee22a10ee74a65
The hashes match, so this is the blob hash for any file that consists solely of the 15-byte line "some file data" as terminated by a newline. Note that it is the content that determine the hash ID: the file's name here is irrelevant. (This means the file's name must be, and is, stored elsewhere—in Git, in one or more tree objects.)
(Note that SHA-1 is no longer considered cryptographically secure. Git is slowly being migrated to other hash algorithms, but there is no rush here. See How does the newly found SHA-1 collision affect Git?)
Upvotes: 5