Reputation: 141
I'm a git newbie and I'm a bit confused between a new working tree and a branch. After reading the docs I'm still confused so I decide to ask.
I understand that a Git branch is like a new 'branch' in the current working tree. If I want to develop a new feature I can just checkout to my newly created branch from the Main branch, develop the feature then merge/rebase it into the Main branch.
I want to confirm my understanding: The concept of a new working tree in the same repository is that we create a completely new working space/tree (with a separate Main branch). And in that tree, I can have many new branches?
I would really appreciate it if you can point to an example (on Github) that the repo consists of more than one working tree so I can study further.
Thank you!
Upvotes: 9
Views: 3799
Reputation: 488063
There is a difference here, and it's significant, but understanding it is tricky. To get there, let's start with this: Git, in the end, is all about commits. It's not about branches, but branch names help us (and Git) find commits. It's not about files either, although each commit contains files. It's really about the commits. So we need to start with what a commit is and what it does for you. Then we'll go on to branch names and how they find commits, and Git's index and your working tree. Once we have all of those covered properly, we can introduce git worktree
and its peculiarities.
A commit, in Git, stores two things:
It has a full snapshot of every file, frozen for all time, as of the form that file had at the time you told Git to make the snapshot. (We'll see where these files actually come from later: the surprise is that it's not the files you see and work with / on.) These files are stored in a special, read-only, Git-only, compressed and de-duplicated format, that only Git itself can use. The de-duplication takes care of the fact that each new commit usually has mostly the same files as some older, existing commit. Those files literally get shared between commits—which is completely safe, since no part of any commit can ever be altered.
Besides the snapshot, each commit has some metadata. The metadata hold information such as who made the commit and when. You get to put a log message in here as well, explaining why you made the commit, so that someone else—or your future self—can come back and see what you meant to do. (It's a good idea to put in a high level explanation of what was wrong with the previous commit and what this one is doing to improve it, in case you later find a bug. There's no point putting in low-level details about individual changed lines, because git diff
or git show
can show that mechanically.) Like the snapshot, this metadata is also frozen for all time—in fact, that's a feature of every internal Git object: none can ever be changed, not even by Git itself.
Now, to find a commit directly, Git needs to know the commit's hash ID. Commit hash IDs are big ugly strings representing a large hexadecimal number, which is actually a cryptographic checksum of the contents of the commit. (This is why the contents can't change: changing even a single bit changes the checksum. Git makes sure, during extraction, that the contents still checksum to the key used to find the object. The objects themselves are stored in a key-value database, with the keys being the checksums.)
So, we need some hash ID H
to look up a commit. The git log
command shows you the hash IDs (full or abbreviated, depending on what log format you choose). But these things are useless to humans, who often can't tell 211eca0895794362184da2be2a2d812d070719d3
from 21127fa9829da1f7b805e44517970194490567d0
for instance. Git can; computers are good at this sort of finicky detail. So each commit, in its metadata, stores the raw hash ID of one or more earlier commits.
Most commits store exactly one earlier-commit hash ID. This makes for a backwards-looking chain of commits. That is, if we know—somehow—that the hash ID of the latest commit on our master
or main
branch is H
(H
here is standing in for some real, albeit ugly, hash ID), we could give that raw hash ID to git log
. Git would then look up commit H
, and commit H
has, inside its metadata, the hash ID of some earlier commit G
:
G <-H
But G
is a commit too, so it has metadata. Git can fish the entire commit G
out of the big database of all commits using the hash ID it just got from H
, and that metadata stores the hash ID of some still-earlier commit:
... <-F <-G <-H
The git log
command can keep this process up forever, or rather, as long as each commit it finds has some previous commit. The chain finally ends at the beginning of history: at the very first commit, which—being the first commit—does not have a previous commit hash ID in it, because it can't.
Hence, all we need to get started, at least, is to somehow magically know the hash ID of the latest commit H
. From there, Git can work backwards, one commit at a time. These commits literally are the history, stored in the repository. Each commit has some metadata telling who made it, when, and why; and each commit has the hash ID of the next-earlier commit. Each commit is a full snapshot of the entire set of files as of the time of that commit. And, by comparing any two adjacent commits' content, Git can tell us what changed in that commit. But we do have that one hitch: we need to know the hash ID of the latest commit.
This is where branch names come in. A branch name like master
or main
simply holds one commit hash ID. The hash ID stored in that name is that of some commit that definitely does exist in the big Git object database, and—by definition—that hash ID tells Git that that commit is the last commit in/on that branch:
...--F--G--H <-- main
Now, just because H
is the last commit on main
does not mean that there aren't any commits that might come "after" H
. For instance, let's now create a new branch name, develop
, and make it also point to commit H
for the moment:
...--F--G--H <-- develop, main
No matter which name we give to git checkout
or git switch
, Git will extract commit H
. Both names select commit H
, after all. In fact, both branches contain the same set of commits. But Git does need to know which name we're using, so let's add a special name, HEAD
, to these drawings:
...--F--G--H <-- develop, main (HEAD)
Here, HEAD
is "attached to" main
. This means we're on branch main
, using commit H
. If we run git checkout develop
or git switch develop
, Git will switch from commit H
to commit H
—that's not much of a switch!—but will now attach HEAD
to the name develop
:
...--F--G--H <-- develop (HEAD), main
From here, let's make one new commit. Git will give this new commit a new, universally-unique hash ID.1 We'll just call it I
, and draw it in like this:
...--F--G--H <-- main
\
I <-- develop (HEAD)
Commit I
will point backwards to its previous commit H
, from which we grew I
, and—this is the sneaky trick—now Git writes I
's hash ID into the name develop
. This advances the branch. Now develop
contains commit I
, as its final commit, along with commits up through H
; main
still ends at H
.
This is how branches grow by making new commits: each time we make a new commit, the branch name advances as well. If we make two commits on develop
, we get:
...--F--G--H <-- main
\
I--J <-- develop (HEAD)
If we flip back to main
, and either create a new branch name, or just make commits directly on main
, we find that the two branches will diverge:
K--L <-- main (HEAD)
/
...--F--G--H
\
I--J <-- develop
If we decide that having made those commits directly on main
is a bad idea, we can now create a new branch to point to L
:
K--L <-- main (HEAD), feature
/
...--F--G--H
\
I--J <-- develop
and then force Git to move the name main
back to H
(we'll have to find H
's hash ID, perhaps with git log
; we might then use git reset --hard
to move main
):
K--L <-- feature
/
...--F--G--H <-- main (HEAD)
\
I--J <-- develop
What this emphasizes is that branch names move. Normally, they move by you adding new commits to the branch, using git commit
. But you can move them any which way, using git reset
or git branch -f
for instance. Git simply uses the branch names to find the commits.
(If you move some branch names backwards, some commits can become very hard to find. For instance, suppose we force develop
back to commit I
. How will we find commit J
after that? There are some ways, but it gets messy, so whenever you're forcing a branch name around, you should think carefully first.)
1"Universally unique" here means that this hash ID not only doesn't occur in this Git repository yet, it also isn't in any other Git repository, now, in the past, or in the future! Git doesn't actually have to meet this strong a constraint—the new hash ID need only not-exist in any Git repository that this Git repository will ever "meet", except when the two meeting Gits are sharing this commit—but Git tries to achieve it anyway, since we don't know for sure which Git repositories we'll connect together in the future. This is why hash IDs are so big and ugly. (As it turns out, they're not quite big and ugly enough now. They were in 2005 when Git was first released, but time marches on.)
I noted several times that the files in commits are all read-only and Git-only and literally unusable. For this reason, when you check out a commit, Git has to copy these files out, into a form that your computer can use normally. These copies go into your working tree. This is pretty straightforward and easy to understand, really: the commit holds an archive, that has to be de-archived to be used. So git checkout
or git switch
does that: it removes the previous checkout, if any, and replaces it with the commit you choose, which is the one some branch name selects.
(In practice, it's actually a lot fancier and more complicated than this. First, to go fast, the checking out process tries not to touch any file it doesn't have to; second, to avoid discarding uncommitted work, it has to check to make sure that none of the files it will remove-and-replace have uncommitted work in them. But "remove the old commit, put in the new one" is fine as a starting model in your head.)
I also mentioned earlier, though, that git commit
does not actually make the new snapshots from what you have in your working tree. Instead, Git inserts an extra "copy"—in quotes here as I'll explain in a moment—of each file, between the current committed version and your working tree copy. This extra "copy" is in what Git calls, variously, the index, or the staging area, or—rarely these days—the cache. All three names refer to the same thing, which I'm calling "the index" here.
What is in the index—at least initially, right after a fresh check-out—is a pre-compressed-and-de-duplicated "copy" of each file from the current commit. Since these are all in the current commit, the index doesn't really contain any actual file copies at all, just references to ready-to-re-commit files.
As you do your work, Git expects you to run git add
on each file you change in your working tree. When you do run git add
, Git:
So any new file is now in the index, and any updated file has its index copy swapped out for the new, ready-to-commit, compressed and de-duplicated data.
When you run git commit
, Git simply packages up the index to use as the new commit snapshot. This means that the index's role is to act as your proposed next snapshot. It starts out matching the current commit, but changes as you run git add
.
This is why one of the names for the index is the staging area: as you update files, you arrange the updated one "on stage", ready to be snapshotted. The copies in your working tree are yours to fuss with, except:
git checkout
or git switch
to overwrite them due to switching to some other commit;git add
to tell Git: copy from working tree back into your index, to make a new copy ready for the next commit.The "copies" (pre-de-duplicated) in Git's index are for Git to use in the next commit.
(The index takes on an expanded role during git merge
operations, though we won't cover that here. This expanded role is why "index" is, maybe, a better name than "staging area". But "staging area" is a fine name for how it works outside this case.)
git checkout
or git switch
command picks a new branch name and/or commit to check out. (You can pick a raw commit, using detached HEAD mode, but we won't cover that here.) Assuming you use a new branch name, Git will now attach HEAD
to that branch name, so that becomes the current branch name.git add
them or use some other Git command to replace the working tree content, everything here is yours to play with.git worktree add
In the above picture, there is:
HEAD
, which names which branch is the current checked-out branch;This means that if we are, for instance, in the middle of working on some new feature, and some highly important must-be-fixed-immediately bug comes up, we have to:
If that's a problem, we could just clone the source repository again. This gets an entirely separate repository, in which we have all the same commits—the hash IDs match up because they are literally the same commits—but we have a new working tree, index, and HEAD
. But it might be nice if we could avoid having to run git clone
and throw a lot of disk space and/or networking time and/or whatever other resources it takes at this.
What if we could, without disturbing our existing working tree, just add a new working tree? We'd also need a new index and a new HEAD
so that this extra working tree could have a different branch checked out in it.
This is precisely what git worktree add
does. We tell Git: make me a new working tree, and check out some branch in it. Git makes all three things—the HEAD
, the index, and the working tree—and does a git switch
or git checkout
there to fill that working tree from that commit as selected by that other branch name.
There's an odd-seeming (at first) constraint, though: our new working tree must be on a different branch than our main working tree or any other added working tree. The reason why becomes clearer once you think about how git commit
automatically updates whichever branch name is checked out. As an exercise, think about what would happen if two working trees both had branch develop
checked out, and then you made a new commit in one of them. What happens to the files, and Git's index, in the other working tree? (I'll answer that one for you: nothing happens to them.) So then, what happens when you go into that other working tree and try to make another new commit?
Instead of trying to make this all work (one can imagine various ways to make it work), Git simply forbids it entirely, so that the problem can't come up in the first place. So that's why this odd restriction exists. This restriction does not apply to detached-HEAD mode, so added working trees can be in detached HEAD mode on any specific commit, but again, we haven't really covered detached HEAD mode here.
I understand that a Git branch is like a new 'branch' in the current working tree.
No: a working tree is just a place for you to do your work. The term branch is ambiguous (see What exactly do we mean by "branch"?), in Git, but a branch name, like main
or develop
, is just a name by which we (and Git) find one specific hash ID. We can make that name point to any commit we like. Each repository has its own names: if I clone your GitHub repository, I have my own branch names. When you clone your GitHub repository, you get your own branch names in your clone. The GitHub repository has its own branch names. None of these names has to match: I can work with your main
while calling it niam
locally, if I like (though that would be silly and just make extra work for me).
The concept of a new working tree in the same repository is that we create a completely new working space/tree (with a separate Main branch).
No: we create a new working tree within our repository clone, but share everything else.2 In particular, all the branch names are shared. This is all within one clone (one repository).
I would really appreciate it if you can point to an example (on Github) that the repo consists of more than one working tree so I can study further.
This is literally impossible: first, added work-trees are specific to each clone. Moreover, GitHub repositories are so-called bare repositories, that have no working trees at all (so that no one can do any work directly on GitHub, not that we have logins on GitHub in the first place).
2Except index and HEAD
as already enumerated, and also certain work-tree-specific special refs, such as ORIG_HEAD
, MERGE_HEAD
, CHERRY_PICK_HEAD
, and so on. As I write this, it occurs to me that I'm not sure if FETCH_HEAD
is work-tree-specific, or not. Special refs like that for bisection are also work-tree-specific. The details here are messy and rather ad-hoc; some details were missed in the original implementation.
Upvotes: 16
Reputation: 119
Creating a working tree is a way of distinguishing the commits tracked in your branches and is useful for hotfixes. When you create a working tree, a new branch will be created automatically. The benefit is that your current version of the application will still be on you local environment version and you will not have to revert dev to previous version to push changes to.
For example, you are in process of migrating the application from MySQL to a Postgres, but the changes are not ready for production. A bug is reported that needs to be patched asap. Your dev environment no longer has a running MySql Instance. If you create a new working tree and patch the application only the changes/commits in that tree will be pushed to Main, and not the changes in Database Providers.
Upvotes: 3