From my understanding, git SHA1 hashes had the side effect of reducing disk storage by not duplicating identical objects and zlib compression was introduced to explicitly reduce disk storage of repositories. Later packs were added which introduced deltas to further reduce size and also grouped multiple objects in a single file in order to improve network transmission. I get that the introduction of deltas reduces the size further and that grouping objects together in a single file might have some network improvements. But is grouping the files together in a pack file actually necessary on disk ? I'm not sure what the benefit is, and it seems that it could cause performance issues during garbage collection because potentially large files might have to be re-written because of the pruning of an object (which I know is somewhat mitigated by putting large files first). I just don't see the benefit of actually grouping objects into a pack file. Is it to reduce the amount of chatter when negotiating which objects need to be transmitted? If this is the case, it seems that the .idx file could "define" a virtual pack, but leave the actual objects individual files on the disk and only "pack" them on transmission. I'm mainly wanting a better understanding of pack files and the reasons for them. I have been working with a co-worker who has a problematic repository and understanding pack files is helping me help him. CLARIFICATION: My main question is not "why are pack files useful", it is: What is the benefit of storing the individual objects together in a pack file instead of having the index just point to individual files? What benefit is there? I only see the disadvantage of having to re-write pack files when objects are pruned. I totally get the benefit of deltas. MORE INFO: More understanding on how pack files work and why: Pack files are primarily optimized for network transmission, reducing the total size of the data being transmitted. This seems to be the driving force behind the design decisions. In order to reconstruct an object, every pack file must be searched until the object ID/hash is found. Structure of the index files allows quick binary search, and the index and pack file structures allow quick seeking to get the base and deltas Pack files are self-contained which means a specific pack file must contain the base file and any deltas necessary to construct a single object So what I'm seeing is: The fewer index files that need to be searched, the faster an object will be found Having the base and all deltas for related objects in a single OS file improves performance of recreating an object by only opening one file (for the actual data) Every bit and byte counts for transmission over a network I'm realizing through all of this that my main underlying concern is the size on disk of the pack files. Extremely large disk files are more difficult to deal with in general -- both from a backup/restore perspective and from a content modification perspective. The above 3 points I have observed don't necessitate, from what I am understanding, getting as many objects into an actual .pack file as possible. I see the benefit of as many entries in a .idx file as possible to speed up finding an object, but I have a hunch the .pack files could be stored as multiple smaller files and still achieve the network and on-disk performance goals. Even a scheme as simple as a single pack file per base and it's delta tree. The existing index scheme could still group these together and keep the existing pack structure for transmission. Anyway, I think I have answered my initial question with a bit more research, but have revealed what I was actually chewing on in the back of my head, and now it's a bit more into the hypothetical realm.

Reputation: 3476

Are git pack files actually necessary on disk?

From my understanding, git SHA1 hashes had the side effect of reducing disk storage by not duplicating identical objects and zlib compression was introduced to explicitly reduce disk storage of repositories. Later packs were added which introduced deltas to further reduce size and also grouped multiple objects in a single file in order to improve network transmission.

I get that the introduction of deltas reduces the size further and that grouping objects together in a single file might have some network improvements.

But is grouping the files together in a pack file actually necessary on disk? I'm not sure what the benefit is, and it seems that it could cause performance issues during garbage collection because potentially large files might have to be re-written because of the pruning of an object (which I know is somewhat mitigated by putting large files first).

I just don't see the benefit of actually grouping objects into a pack file. Is it to reduce the amount of chatter when negotiating which objects need to be transmitted? If this is the case, it seems that the .idx file could "define" a virtual pack, but leave the actual objects individual files on the disk and only "pack" them on transmission.

I'm mainly wanting a better understanding of pack files and the reasons for them. I have been working with a co-worker who has a problematic repository and understanding pack files is helping me help him.

CLARIFICATION: My main question is not "why are pack files useful", it is: What is the benefit of storing the individual objects together in a pack file instead of having the index just point to individual files? What benefit is there? I only see the disadvantage of having to re-write pack files when objects are pruned. I totally get the benefit of deltas.

MORE INFO:

More understanding on how pack files work and why:

Pack files are primarily optimized for network transmission, reducing the total size of the data being transmitted. This seems to be the driving force behind the design decisions.
In order to reconstruct an object, every pack file must be searched until the object ID/hash is found.
Structure of the index files allows quick binary search, and the index and pack file structures allow quick seeking to get the base and deltas
Pack files are self-contained which means a specific pack file must contain the base file and any deltas necessary to construct a single object

So what I'm seeing is:

The fewer index files that need to be searched, the faster an object will be found
Having the base and all deltas for related objects in a single OS file improves performance of recreating an object by only opening one file (for the actual data)
Every bit and byte counts for transmission over a network

I'm realizing through all of this that my main underlying concern is the size on disk of the pack files. Extremely large disk files are more difficult to deal with in general -- both from a backup/restore perspective and from a content modification perspective.

The above 3 points I have observed don't necessitate, from what I am understanding, getting as many objects into an actual .pack file as possible. I see the benefit of as many entries in a .idx file as possible to speed up finding an object, but I have a hunch the .pack files could be stored as multiple smaller files and still achieve the network and on-disk performance goals. Even a scheme as simple as a single pack file per base and it's delta tree. The existing index scheme could still group these together and keep the existing pack structure for transmission.

Anyway, I think I have answered my initial question with a bit more research, but have revealed what I was actually chewing on in the back of my head, and now it's a bit more into the hypothetical realm.

Upvotes: 3

Answers (3)

jthill

Reputation: 60393

Files have a constant storage overhead. It's been reduced about as far as practicable, so to whatever extent it's not negligible, it's necessary, so nobody much worries about it. It's generally at least hundreds of bytes. Opening a file also has a cost -- the metadata has to be read, permissions have to be checked, current read positions have to be maintained. Either of those is, on the scale of individual objects and what delta compression gets you, a very heavy penalty to pay, far exceeding any compression benefits for small objects, and I'm not trying to be exhaustive here or paint a full picture.

Upvotes: 2

chepner

Reputation: 531738

Without pack files, Git isn't storing deltas at all. If you have a 100Kb file in one commit, then create a new commit that changes a single byte in that file, that commit also stores the 100Kb file in its entirety. git show simply "renders" the commit as a diff from the parent.

Pack files literally replace the copy the file with an actual diff, which means a checkout would require reconstructing the file, rather than simply copying it from the repository into your working directory.

Upvotes: 1

Acorn

Reputation: 26166

In general, grouping many small files into a single, big one typically increases the compression ratio, because you can usually find shared patterns in them.

It also may help reducing a lot of syscall overhead, which helps performance, specially in some operating systems.

Upvotes: 1

Are git pack files actually necessary on disk?

Answers (3)

Related Questions