SteeveDroz
SteeveDroz

Reputation: 6136

What is a blob under the hood?

I read on the official git website that:

The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. These systems (CVS, Subversion, Perforce, Bazaar, and so on) think of the information they keep as a set of files and the changes made to each file over time, (...)

Git doesn’t think of or store its data this way. Instead, Git thinks of its data more like a set of snapshots of a mini filesystem. Every time you commit, or save the state of your project in Git, it basically takes a picture of what all your files look like at that moment and stores a reference to that snapshot. (...)

So I was wondering: if snapshots and not changes are saved, does it mean that if I change but one character in a 10Ko file, a second 10Ko file (or blob) will be created in my repository?

What is a blob under the hood? The file itself? Is it compressed? Is any small change in my file growing the repository drastically?

As I know you guys, I'll answer your comments before they come: I understand that disk space is not a problem anymore and that I don't have to worry about copying 10Ko, my question is just to satisfy my curiosity.

EDIT

Ok, Git's blob data and diff information gives half of the information. But is it compressed and/or space-optimized in any way?

Upvotes: 1

Views: 380

Answers (2)

jthill
jthill

Reputation: 60295

So I was wondering: if snapshots and not changes are saved, does it mean that if I change but one character in a 10Ko file, a second 10Ko file (or blob) will be created in my repository?

Short answer: yes. Details: config options core.compression and core.loosecompression give compression parameters for loose and packed objects. Loose objects by default use bare-minimum compression settings. All objects are stored as type[sp]length[nul]data output compressed by the exact equivalent of the zpipe example that comes with zlib itself, as usual, git is very straightforward. Internally, the packing is all entirely internal to git's object access layer.

Fetching zlib, building zpipe, and running it on loose objects can be very useful. It's one thing to hear that objects are just the data with e.g. "blob 123\0" stuck on the front for a 123-byte blob or "commit 1323\0" stuck on the front of the text of a commit, and another to see that it really is that simple. Even the pack format isn't much, it just happens by pure random chance to work really, really well.

Git packs up and compresses loose objects whenever its heuristics say there are enough of them lying around to make the delta compression payoff gratifying. You can tune those, too, but all of them are more or less on target as-is, and looking back on the times I've bothered manually repacking I can't really say it was worth bothering about.

Upvotes: 1

bperson
bperson

Reputation: 1335

(Quick and noobish answer)

It gets compressed when packing your repo. From what I know he will sometimes inverse the diff so that the plain text version stored is the latest one. And the diffs are with the older ones. This makes accessing the latest changes quicker.

Upvotes: 2

Related Questions