Lutz Prechelt
Lutz Prechelt

Reputation: 39336

A commit in Git: Is it a snapshot/state/image or is it a change/diff/patch/delta?

When learning git, this is very confusing.

So what is an appropriate mental model for a Git commit?

Upvotes: 23

Views: 5678

Answers (5)

Lutz Prechelt
Lutz Prechelt

Reputation: 39336

So which answer is correct?

If looking at these answers does not help resolving your confusion about git commits, this is because my original question was not formulated well: It asked "What is a git commit?" instead of asking what I really meant to learn "How should I think about git commits?".

As a result, the answers use different perspectives. So which of them are correct?

The "git commit model dualities" answer?

This answer is correct for the updated version of the question.

It talks about how you need to apply different mental models for what is a git commit, depending on which git command you are currently thinking about.

If you want to understand how to use git, you will definitely need to have this understanding.

The "Commits are snapshots, not diffs" answer?

This answer is appropriate for the original version of the question and less so for the updated (and intended) version.

It talks about the technical representation of commits.

If you only want to understand how to use git, this knowledge may or may not be helpful for you:

  • It contains a lot of detail beyond the first (and unavoidable) answer, so it may get in the way.
  • However, if you best understand software by understanding how it works internally, it may be a good starting point.

If you are not keen on learning internals, the dualities answer is fine initially, but be aware that in order to become a Git power user, you will need to learn about the internals eventually; they shine through frequently in the git documentation and many other git explanations.

One of the others?

  • timgeb's answer is a suuuuper short way of talking about both views.
  • Kaz's answer also mentions two concepts (parents and index) that pertain to the question only remotely and may therefore be a bit difficult to understand.

Upvotes: 0

Lutz Prechelt
Lutz Prechelt

Reputation: 39336

Understand the Git commit model dualities

Short answer: both.

Medium answer: It depends.

Long answer: Git is a bit like quantum phenomena: Neither of the two views alone can explain all observations. Read on.

Internally, Git will use both representations, depending (conceptually) on which one it deems more efficient in terms of storage space and execution time for a given commit at a certain time. The snapshot representation is the primary one.

From the user's point of view, however, it depends on what you do:

Duality 1: Commit as a snapshot vs. commit as a change

Indeed some commands simply only make any sense at all when you think about commits as snapshots of the working tree. This is most pronounced for checkout, but is also true for stash and at least halfway for fetch and reset.

For other commands, madness is the likely result when you try to think of commits in this manner. For those other commands, commits are clearly treated as changes,

  • either in the form of patches you can look at (e.g. show, diff)
  • or in the form of operators you can apply to modify your working tree (e.g. apply, cherry-pick, pull)
  • or in the form of operators you can apply to modify other commits (e.g. rebase)
  • or in the form of operators you can apply to create new commits (e.g. merge, cherry-pick)

Duality 2: Commit as a fixed thing vs. commit as something fluid

There is a side-effect of duality 1 that can shock Git newbies accustomed to other versioning systems. It is the fact that Git appears to not even commit itself to its commits.

Huh?

Assume you have created a branch X containing what you like to think of as your commits A and B. But master has progressed a little, so you rebase X to master.

When you think of A and B as changes, but of master as a snapshot (hey, both commit models occur in a single operation!), this is not a problem: Just apply the changes A and B to the snapshot master.

This thinking is so natural that you will barely notice that Git has now rewritten your commits A and B: They now have different snapshot content and hence a different SHA-1 ID. In Git, the conceptual commit that you think of as a developer is not a fixed-for-all-times kind of thing, but rather some fluid object that changes as a result of working with your repository.

In contrast, if you think of all three (A, B, and master) as snapshots or of all three as changes, your brain will hurt and you will get nowhere.

Disclaimer

The above is a much-simplified description. In Git reality,

  • a commit is not a snapshot at all, it is a piece of metadata (the who/when/why of a snapshot) plus a pointer to a snapshot;
  • the snapshot is called a tree in Git lingo;
  • the commits-as-changes internal representation uses packfiles;
  • some of the above-mentioned commands have further roles that do not fit the same characterization;
  • and even for the given roles it is to some degree a matter of taste into which category (or -ies) certain commands belong.

And don't get confused by the fact that the Pro Git book's very first characterization of Git (in section "Git Basics") is "Snapshots, Not Differences".

Git is complicated after all.

Upvotes: 12

timgeb
timgeb

Reputation: 78690

The answers here are too long.

  1. A Commit is a small metadata file. It contains all the information to restore a full snapshot of your project.
  2. Some commands, such as cherry-pick, compute differences between snapshots.

Upvotes: 2

VonC
VonC

Reputation: 1324447

While it could be construed as both, the GitHub Engineering team is clear (Dec. 2020):

Commits are snapshots, not diffs

Derrick Stolee starts with

  • Object ID
  • blobs (file content)
  • tree (directory listing)
  • commits: snapshots!

Object ID

The most important part to know about Git objects is that Git references each by its object ID (OID for short), providing a unique name for the object.
We will use the git rev-parse <ref> command to discover these OIDs.
Each object is essentially a plain-text file and we can examine its contents using the git cat-file -p <oid> command.

Blobs (file content)

To discover the OID for a file at your current revision, run git rev-parse HEAD:<path>.
Then, use git cat-file -p <oid> to find its contents.

Trees (directory listings)

Note that blobs contain file contents, but not the file names!
The names come from Git’s representation of directories: trees.
A tree is an ordered list of path entries, paired with object types, file modes, and the OID for the object at that path.
Subdirectories are also represented as trees, so trees can point to other trees!

Finally:

commit: snapshot in time

A commit is a snapshot in time. Each commit contains a pointer to its root tree, representing the state of the working directory at that time.
The commit has a list of parent commits corresponding to the previous snapshots.
A commit with no parents is a root commit and a commit with multiple parents is a merge commit.
Commits also contain metadata describing the snapshot such as author and committer (including name, email address, and date) and a commit message.
The commit message is an opportunity for the commit author to describe the purpose of that commit with respect to the parents.

https://github.blog/wp-content/uploads/2020/12/commit.png?resize=399%2C268?w=399

Even though commits are snapshots, we frequently look at a commit in a history view or on GitHub as a diff. In fact, the commit message frequently refers to this diff.

The diff is dynamically generated from the snapshot data by comparing the root trees of the commit and its parent. Git can compare any two snapshots in time, not just adjacent commits.

Computing diff is what enable git cherry-pick or git rebase.

And since commits are not diff...

Git doesn’t track renames. There is no data structure inside Git that stores a record that a rename happened between a commit and its parent.
Instead, Git tries to detect renames during the dynamic diff calculation. There are two stages to this rename detection: exact renames and edit-renames.

After first computing a diff, Git inspects the internal model of that diff to discover which paths were added or deleted.
Naturally, a file that was moved from one location to another would appear as a deletion from the first location and an add in the second. Git attempts to match these adds and deletes to create a set of inferred renames.

Upvotes: 11

Kaz
Kaz

Reputation: 58578

A commit is a snapshot state. When you do git diff, it calculates the diff to the parent. This is why there can be multiple parents (the case when there is a merge). Internally, there is delta compression going on, but the versioning model isn't patch-based.

A central concept in git is the index. This is a big object containing the tree of objects being tracked. Changes are staged when they propagate from the working copy to the index; this puts the index into a modified state. The commit operation turns that state into a new commit.

Upvotes: 4

Related Questions