user12068519
user12068519

Reputation: 29

git working directory vs staging area vs local repo vs .git folder

What does the git local repository include? Does it include both the codebase and the history?

I read that the .git folder is the git repository. But it's simply contains the history of changes and not the codebase. Is the repository just the history of the changes while the local repository includes both the history and codebase?

Is the working directory the codebase?

Upvotes: 2

Views: 2869

Answers (3)

Serge
Serge

Reputation: 12344

There are 2 concepts which you need to understand:

  1. git directory which contains git metadata, commit history, branch info, ...

  2. work tree which contains files checked out from the git directory (your working directory).

Git repository usually means both: git directory and work tree. However, sometimes people reference to git directory as a git repository.

Multiple git commands only need to know about the git directory. Others require both, git directory and work tree. There are several ways to tell those commands about locations of git directory and work tree.

Usually both of them are combined in a single directory structure:

 topdir <-- work tree
 |- .dir <-- git directlry
 |- checked out files an directories  

This way both are discovered automatically (and referenced as a git repository).

Upvotes: 0

torek
torek

Reputation: 488003

A repository consists of several parts, which you can group in different ways. I'll start with this grouping:

  • The main bulk of a repository, which you get even with git clone --bare, is a sort of database, or really, a pair of databases, plus a bunch of auxiliary files needed to use them. This is the stuff that is in the .git directory in a normal (non-bare) clone.

    The things in this database are in a form suitable for Git to use, but not a form suitable for you, or everything else that you do on your computer, to use. Hence:

  • The other part of the repository is your work-tree. The work tree, or working tree, or some variant on this name, is where you do your work. A bare clone omits the work-tree, so that you cannot do any work in it.

  • In between the repository proper and your work-tree lies Git's index, which Git also calls the staging area (or, rarely these days, the cache). The current actual implementation of the index is a file in .git/index plus, sometimes, one or more additional files to make things go a bit faster, although in general you shouldn't concern yourself too much with the inner workings of the index.

The index does not fit terribly well into this picture, and there's a good reason for that: it's really intended to group together with the work-tree, not with the main Git repository. Cloning a repository does not clone the index, and since Git 2.5, Git has offered a command, git worktree, that allows you to add more work-trees. When you do add a work-tree, you actually get a whole set of extra files: <HEAD and other special references such as those for git bisect; index; work-tree>. But since HEAD and these various references also don't get copied by git clone, and do all live somewhere under the .git directory, you always have to deal with this slightly muddled, mixed-up image.

From a good distance, then, there's a clean separation: .git holds the stuff that gets cloned (and that Git deals with), and your work-tree holds the stuff you work on (that doesn't get cloned). A bare repository has only the stuff that gets cloned. But in fact there's stuff in .git that doesn't get cloned too, including the index/staging-area. A bare repository still has a HEAD and an index, even though they don't get cloned. Last, adding work-trees with git worktree add not only creates the new work-tree, it also creates a bunch of files inside .git that also don't get cloned, and that are meant for the added work-tree only.

Is the repository just the history of the changes ...

In some sense this doesn't matter, but Git is very up-front about its storage system, and this needs a bit of adjustment: Git doesn't store changes at all! Instead, Git stores snapshots.

I mentioned in my first bullet-point that what's in .git is primarily a pair of databases. These two databases are both simple key-value stores. One database, usually rather smaller, stores names and hash IDs. The names are a generalized form of branch, tag, and other names. For instance the name master, which is almost certainly a branch name, is really refs/heads/master, which is definitely a branch name. The name v2.5.0—the Git version that introduces git worktree—is a tag name and is really refs/tags/v2.5.0. Running git rev-parse allows you to turn an arbitrary name, including a branch or tag name, into a hash ID, if there is such a name in this database:

$ git rev-parse v2.5.0
8d1720157c660d9e0f96d2c5178db3bc8c950436

This hash ID is the key to the bigger, and in some sense main, database. That database maps hash IDs to Git objects. A Git object is how Git stores data and metadata, including commits and the files that act as the snapshots in that commit.

Given any hash ID, you can have a low-level Git command get you the object's type:

$ git cat-file -t 8d1720157c660d9e0f96d2c5178db3bc8c950436
tag

or contents:

$ git cat-file -p 8d1720157c660d9e0f96d2c5178db3bc8c950436 | sed 's/@/ /'
object a17c56c056d5fea0843b429132904c429a900229
type commit
tag v2.5.0
tagger Junio C Hamano <gitster pobox.com> 1438025401 -0700

Git 2.5
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAABAgAGBQJVtoa5AAoJELC16IaWr+bLRtQP/0RYjVe9fLubiN5vLaAJ98B5
K3apw8bScJ4bZQJiOGMZg7AJ8pSB9XchqopjNlO2v8XVrZEkFPQ7ln3ELjOITusO
[snip rest of PGP signature]

In this case, the tag object holds the hash ID of the commit object. That's the first line of the above. So next, we can have Git fish out the commit object, and print that:

$ git cat-file -p a17c56c056d5fea0843b429132904c429a900229 | sed 's/@/ /'
tree deec48fbc77f5951f81d7b5559360cdefe88ce7e
parent 7a2c87b1524e7e0fbb6c9eef03610b4f5b87236a
author Junio C Hamano <gitster pobox.com> 1438025387 -0700
committer Junio C Hamano <gitster pobox.com> 1438025387 -0700

Git 2.5

Signed-off-by: Junio C Hamano <gitster pobox.com>

The above are, in fact, the full contents of the commit that is Git 2.15 (with @ changed to space to perhaps, maybe, cut down on spam-load). The tree line is how the commit saves a full snapshot of every file, as that gives yet another hash ID of yet another internal object:

$ git cat-file -t deec48fbc77f5951f81d7b5559360cdefe88ce7e
tree

If we look inside the tree we'll find, e.g., that it has an entry reading:

100644 blob 5ca601ee14fd2ab3b78577aa22a5db778bc7fbe0    base85.c

which gives us the hash ID of the complete file base85.c that is part of that commit.

That file is still the same in the current version of Git, and we can see that using git rev-parse:

$ git rev-parse master:base85.c
100644 blob 5ca601ee14fd2ab3b78577aa22a5db778bc7fbe0    base85.c

which is a shortcut way to do what we just did above:

$ git rev-parse v2.5.0:base85.c
5ca601ee14fd2ab3b78577aa22a5db778bc7fbe0

Git looked up v2.5.0 (as refs/tags/v2.5.0) in the first database and found that it was a tag hash ID. So git rev-parse found the actual commit, and the tree, and the line for base85.c, and extracted the hash ID.

Using that hash ID, we can extract the full contents of base85.c directly, with git cat-file -p. The file begins this way:

$ git cat-file -p 5ca601ee14fd2ab3b78577aa22a5db778bc7fbe0
#include "cache.h"

#undef DEBUG_85

#ifdef DEBUG_85
#define say(a) fprintf(stderr, a)
#define say1(a,b) fprintf(stderr, a, b)
#define say2(a,b,c) fprintf(stderr, a, b, c)
#else
#define say(a) do { /* nothing */ } while (0)

There's a direct line from hash ID to contents, and a somewhat less direct line from names—whether they're branch or tag names, or composites like v2.5.0:base85.c—to contents, that involves following the tag to the commit to the tree to the specific entry to get the hash ID.

Getting from snapshots to changes

Almost everything Git does starts with this kind of database lookup. Should you wish to compare two commits, though, you can have Git extract both of them, and just tell you what's different. Commit 745f6812895b31c02b29bdfe4ae8e5498f776c26, for instance, has commit d4b12b9e07eba2e4ec1eff38a4151c9302bd1e2c as its parent, so we can run:

git diff d4b12b9e07eba2e4ec1eff38a4151c9302bd1e2c 745f6812895b31c02b29bdfe4ae8e5498f776c26

to have Git extract both commits, compare them, and show us what changed:

$ git diff d4b12b9e07eba2e4ec1eff38a4151c9302bd1e2c 745f6812895b31c02b29bdfe4ae8e5498f776c26
diff --git a/Documentation/RelNotes/2.24.0.txt b/Documentation/RelNotes/2.24.0.txt
new file mode 100644
index 0000000000..a95a8b0084
--- /dev/null
+++ b/Documentation/RelNotes/2.24.0.txt
[actual diff snipped]

and so on.

Note that when we looked at the 2.5.0 commit, we saw:

tree deec48fbc77f5951f81d7b5559360cdefe88ce7e
parent 7a2c87b1524e7e0fbb6c9eef03610b4f5b87236a

That parent line gives Git the hash ID of the commit that comes before the 2.5.0 commit. So Git can automatically compare a commit to its parent. If we know the hash ID of one commit, we can make Git fish out the hash ID of its parent—and in fact, instead of running git diff, we can run git show, which does this all for us. So that's what we tend to do.

A simple:

git show master

really consists of:

  • parse the name master to get a hash ID
  • use that to find the commit
  • show the commit's author, time-stamps, log message, etc
  • use the commit to find the commit's parent
  • use the two commit hash IDs to extract the two trees
  • compare all the files in the two snapshots
  • for each file that's different, show what's different

All of this takes place through the stuff in the .git repository. What's in the index, and in your work-tree, is not important and not required here, so all of this can be done with a bare repository.

Summary

Should you want to actually do any work with a Git repository, you need a non-bare repository, so that you have a work-tree. Git will extract stuff from the object database, as found by big ugly hash IDs, into your work-tree so that you can see it and work on it. Git will let you use names, provided those names are in the name-to-hash-ID database, in place of hash IDs. Git needs the hash IDs; but you probably need the names just to find the hash IDs.

The index or staging area sits between the work-tree and the repository. Its main function is to hold copies of files extracted from the repository—from the object database—so that they're ready to go into new commits. As such, you can think of it as the place you assemble your new commits.

So:

  • Your work-tree holds files in your computer's ordinary format, rather than in the special Git-only format that the index / staging-area holds and that goes into each new commit you make.

  • The index / staging-area holds the proposed next snapshot. This starts out the same as the current snapshot: the commit you checked out so as to get it into your work-tree. If you change a file in your work-tree, you need to copy it back into the index so that the updated file is the one that goes into the next commit.

  • Each commit contains a full snapshot of every file, as of whatever form it had in the index at the time you ran git commit.

  • History, in a Git repository, is nothing more than the commits themselves. Each commit remembers its immediate predecessor—the raw hash ID of that earlier commit—and every commit is found by its hash ID. Names like master are mostly for mere humans, who for some reason can't seem to remember random-looking hash IDs.

Branch and tag names have another important role, but for that, you should start with Think Like (a) Git.

Upvotes: 4

sunknudsen
sunknudsen

Reputation: 7250

What does the git local repository include? Does it include both the codebase and the history?

The git local repository includes all files of a given revision and the history of changes.

Is the working directory the codebase?

Yes, at a given revision.

Revisions are "versions" of the codebase for a given branch.

For example, when you git clone https://github.com/expressjs/express, you clone the whole repository of Express which includes its history of changes.

git clone https://github.com/expressjs/express.git
Cloning into 'express'...
remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 30279 (delta 0), reused 0 (delta 0), pack-reused 30276
Receiving objects: 100% (30279/30279), 8.60 MiB | 10.08 MiB/s, done.
Resolving deltas: 100% (17089/17089), done.

You can then switch the codebase to 4.x using git checkout 4.x without having access to the internet.

git checkout 4.x
Branch '4.x' set up to track remote branch '4.x' from 'origin'.
Switched to a new branch '4.x'

Upvotes: 0

Related Questions