user12184817
user12184817

Reputation:

Where are the files stored in the git directory

In the book progit, it says the following line:

The Git directory is where Git stores the metadata and object database for your project. This is the most important part of Git, and it is what is copied when you clone a repository from another computer.

There are two things that I don't understand here. I suppose the git directory is the .git folder, so, where are the different snapshots stored there? It seems the folder is way too small to have them there.

The other thing is, it says that when you clone a git repository it copies the .git folder, but doesn't it also copy the file contents in the working tree? or does it take it out of the .git folder?

Upvotes: 0

Views: 1560

Answers (4)

matt
matt

Reputation: 534949

It seems the folder is way too small to have them there.

Well, it isn't. Your way of measuring might be deceiving you. Keep in mind also that text files may be compacted considerably and that multiple copies of the same file may take up no extra space or very little extra space.

The other thing is, it says that when you clone a git repository it copies the .git folder, but doesn't it also copy the file contents in the working tree?

Effectively yes. git clone does a bunch of things; the two most important are just what you said: it copies the repository itself as the .git directory, plus it then (by default) does a git checkout to copy one branch's tip into the working tree.

Upvotes: 0

Schwern
Schwern

Reputation: 164699

The other thing is, it says that when you clone a git repository it copies the .git folder, but doesn't it also copy the file contents in the working tree? or does it take it out of the .git folder?

To fill in some more detail, a remote repository is typically "bare". That means there is no working tree, there's only the repository files. You can see this with git init --bare.

$ git init --bare test_bare
$ ls test_bare/
HEAD  config  description  hooks  info  objects  refs

Remotes are "bare" so you can push and pull from them without Git having to worry about also changing the working tree, which would get very confusing for anyone using that working tree.

where are the different snapshots stored there? It seems the folder is way too small to have them there.

To add another key point, when you do a git commit it does store a "snapshot" of the complete state of all files, but any unchanged files will not be stored twice.

The content of each file is stored in a compressed "blob" (Binary Large OBject) object. The blob's name is the content's checksum. When you commit and a file is unchanged, Git simply reuses the existing blob.

# file1 and file2 have changed. Git stores new blobs for them.
git add file1 file2
git commit

# Git stores a new blob for file3.
# It references the existing blobs for file1 and file2.
git add file3
git commit

# Since only the filename is changed, Git uses the existing blob.
# The filename is stored in a "tree object", basically a directory listing.
git mv file3 other3
git commit

If you're familiar with how a filesystem works, Git is structured very similarly.

Typically only a handful of small, easily compressed files are changed in a given commit. So the size of each commit is small.

Git has additional tricks, but that's the basic idea. It's also why storing large files in Git is a bad idea; Git stores a complete copy of the file every time it changes. Use Git Large File Storage instead. And it's also why files should be stored in Git decompressed, Git will do its own compression.

See Git Internals in Pro Git for more.

Upvotes: 1

Mark Adelsberger
Mark Adelsberger

Reputation: 45659

I suppose the git directory is the .git folder

For a typical clone, yes. There are exceptions (like bare repos) but that isn't too important to your question; when you look at the local repo you get by cloning something, you can expect .git to be what progit calls "the git directory".

So, where are the different snapshots stored there? It seems the folder is way too small to have them there.

They .git/objects directory contains the repo's content. Files are represented as BLOB objects; directories as TREE objects; and there are also COMMIT objects and various other types of object used for various git features. It's not easy to inspect these files by hand, but the data is there and you can use lower-level git commands to navigate it if you want (e.g. git cat-file).

An object can be in "loose" storage, in which case it's somewhere in the various directories whose names are two hex digits. Or - as would be expected in a fresh clone - they can be in "packed" storage (under .git/objects/pack). A couple forms of compression - including deltas for older versions of files - are used to control the size of this data on disk as the repo history grows. That is why the directory may not seem like it takes "enough" space to hold everything.

(As an aside, certain types of file do not "play nice" with the compression methods git uses; this is one raeson why large binary files should be managed with a tool like LFS.)

it says that when you clone a git repository it copies the .git folder, but doesn't it also copy the file contents in the working tree? or does it take it out of the .git folder?

clone only copies the git directory. Unless given options to the contrary, it then does a checkout of the default branch (usually master), which creates a copy of the working tree. It depends on how your remote is hosted, but the odds are you wouldn't actually find the working tree on the remote, so clone couldn't copy it directly from the remote even if it wanted to. It has to extract it from the database.

One corollary of all this is, only committed data can be shared by clone, fetch, or push. That is to say, suppose you create a local repo

mkdir repo1
cd repo1
git init
touch file1
git add .
git commit -m1
echo hi > file1
touch file2

Now there are reasons why you typically don't use a repo that has worktrees as a remote... but you could.

cd ..
git clone repo1 repo2

Now if you look at repo2, you'll see that it only has an empty file1; nothing that wasn't committed in repo1 is visible - unlike if the working directory had been copied by clone.

cd repo1

Upvotes: 0

jthill
jthill

Reputation: 60255

where are the different snapshots stored there? It seems the folder is way too small to have them there.

It might seem too small, but they're there. Git uses deltas purely for storage compression, the compression engine doesn't care where duplicated content is, if you've got say a license hunk or other boilerplate in all your files, it's very likely stored (in the packs) just once.

Git's first-cut storage is a straight zlib-compressed snapshot with a little type-and-length prefix.

Try this:

git init --template='' test; cd $_
find -type f   # on the Mac you have to say `find . -type f`
echo Hello World > file
git add file
find -type f
git ls-files -s
git repack -ad
find -type f
git verify-pack -v .git/object/pack/*.idx

and that'll get you started on what Git's doing with your content. Ordinarily it waits for a nice fat batch of snapshots to pack up and delta-compress, the more you can batch up before packing the bigger the storage win.

Upvotes: 0

Related Questions