Edison
Edison

Reputation: 4281

Weird Path in Git directory Structure

I have seen following path in directory of .git.

.git/object/3b/12abef878787483abeceddaa5544489abff789a

when infact SHA is 3b12abef878787483abeceddaa5544489abff789a

which is the SHA of the contents of the file and hence should be stored without the /. why did git store the blob in this weird path,what are the advantages of doing so?

Upvotes: 1

Views: 92

Answers (1)

Jonathan Leffler
Jonathan Leffler

Reputation: 753495

The reason is to prevent there being too many files in one directory. With all the SHA values that start 3b stored in a sub-directory 3b, the workload on any single directory is 1/256th of what it would be if all the blobs were in a single directory. Ultimately, this speeds up the performance; there is less searching to do to find a particular blob.

You can see similar effects in the terminfo directory, where the entries are sub-divided into directories based on the first letter of the terminal entry. The CPAN system has authors/id/A/AA/AARDVARK in its naming hierarchy.


Please elaborate a little bit for me.

Suppose that git wants to find the blob 3b12abef878787483abeceddaa5544489abff789a and the directory partitioning scheme is not in use. There might be, for sake of argument, 512 blobs, and to get to the file, the kernel might have to read all 512 directory entries in .git/objects to find the right entry.

Now suppose that the directory partitioning scheme is in use, and that by a miracle of statistical mischance, there are 256 subdirectories each containing 2 files. Now the kernel at worst has to read 256 directory entries with 2-byte names in each entry (compared with 512 directory entries with 32-byte names) in the ./git/objects directory, and then has to read at worst 2 entries with 30-byte names in the ./git/objects/3b directory.

There are complicating factors, such as imperfectly balanced hashing and memory caching and disk accesses, but the general idea is that distributing the files into multiple subdirectories means that the OS kernel has less work to do to find a file. If the number of files in a directory will extend into the multiple hundreds, it is worth considering breaking it down.

Upvotes: 7

Related Questions