Niklas R
Niklas R

Reputation: 16900

Follow Git history via commit parents in Script visits same commit twice

I am trying to write a script that performs a check on every commit, and for that check I need to know the parents of the commit. After the check, I follow the same procedure with the parent commits.

My problem is that I encounter the same commit multiple times – so unless I have a cycle in my repository, I probably do something wrong.

import subprocess

def parents(rev):
  args = ['git', 'rev-list', '--parents', '-n', '1', rev]
  output = subprocess.check_output(args, stderr=subprocess.PIPE).decode()
  items = output.split()
  return items[1:]  # First SHA is the ID of the revision that we passed into the command

revisions = parents('HEAD')
visited = set()
while revisions:
  rev = revisions.pop()
  assert rev not in visited, rev
  visited.add(rev)
  print(rev)  # TODO: Do check on commit
  revisions += parents(rev)

I would expect this to print somethig similar to git rev-list HEAD, but the assertion fires after a while.

Why do I encounter the same commit twice with this method? Is my assumption incorrect that following the parents of a commit allows me to traverse the full history?

Upvotes: 2

Views: 146

Answers (2)

VonC
VonC

Reputation: 1329632

Note: with Git 2.22 (Q2 2019), git rev-list --parents will still visit the same commit multiple times, but will do so faster, because of a performance fix for "rev-list --parents -- pathspec".

See commit 8320b1d (04 Apr 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit d9d65e9, 25 Apr 2019)

revision: use a prio_queue to hold rewritten parents

This patch fixes a quadratic list insertion in rewrite_one() when pathspec limiting is combined with --parents.

What happens is something like this:

  1. We see that some commit X touches the path, so we try to rewrite its parents.
  2. rewrite_one() loops forever, rewriting parents, until it finds a relevant parent (or hits the root and decides there are none). The heavy lifting is done by process_parent(), which uses try_to_simplify_commit() to drop parents.
  3. process_parent() puts any intermediate parents into the &revs->commits list, inserting by commit date as usual.

So if commit X is recent, and then there's a large chunk of history that doesn't touch the path, we may add a lot of commits to &revs->commits.
And insertion by commit date is O(n) in the worst case, making the whole thing quadratic.

We tried to deal with this long ago in fce87ae (Fix quadratic performance in rewrite_one., 2008-07-12, v1.5.6.6).
In that scheme, we cache the oldest commit in the list; if the new commit to be added is older, we can start our linear traversal there. This often works well in practice because parents are older than their descendants, and thus we tend to add older and older commits as we traverse.

But this isn't guaranteed, and in fact there's a simple case where it is not: merges.
Imagine we look at the first parent of a merge and see a very old commit (let's say 3 years old). And on the second parent, as we go back 3 years in history, we might have many commits. That one first-parent commit has polluted our oldest-commit cache; it will remain the oldest while we traverse a huge chunk of history, during which we have to fall back to the slow, linear method of adding to the list.

Naively, one might imagine that instead of caching the oldest commit, we'd start at the last-added one. But that just makes some cases faster while making others slower (and indeed, while it made a real-world test case much faster, it does quite poorly in the perf test include here).
Fundamentally, these are just heuristics; our worst case is still quadratic, and some cases will approach that.

Instead, let's use a data structure with better worst-case performance.
Swapping out revs->commits for something else would have repercussions all over the code base, but we can take advantage of one fact: for the rewrite_one() case, nobody actually needs to see those commits in revs->commits until we've finished generating the whole list.

That leaves us with two obvious options:

  1. We can generate the list unordered, which should be O(n), and then sort it afterwards, which would be O(n log n) total. This is "sort-after" below.

  2. We can insert the commits into a separate data structure, like a priority queue. This is "prio-queue" below.

I expected that sort-after would be the fastest (since it saves us the extra step of copying the items into the linked list), but surprisingly the prio-queue seems to be a bit faster.

Here are timings for the new p0001.6 for all three techniques across a few repositories, as compared to master:

master              cache-last                sort-after              prio-queue
--------------------------------------------------------------------------------------------
GIT_PERF_REPO=git.git
0.52(0.50+0.02)      0.53(0.51+0.02)  +1.9%   0.37(0.33+0.03) -28.8%  0.37(0.32+0.04) -28.8%

GIT_PERF_REPO=linux.git
20.81(20.74+0.07)   20.31(20.24+0.07) -2.4%   0.94(0.86+0.07) -95.5%  0.91(0.82+0.09) -95.6%

GIT_PERF_REPO=llvm-project.git
83.67(83.57+0.09)    4.23(4.15+0.08) -94.9%   3.21(3.15+0.06) -96.2%  2.98(2.91+0.07) -96.4%

Upvotes: 0

larsks
larsks

Reputation: 312770

The behavior you are seeing is intrinsic to the git rev-list --parents command. Consider a repository that looks like this:

A--B--C
 \   /
   D

The output of git log --oneline might be:

0000004 (HEAD -> master) Merge branch "mybranch"
0000003 B
0000002 D
0000001 A

But commit A is the parent of both B and D. So for B:

$ git rev-list --parents -n1 B
0000003 0000001

And for D:

$ git rev-list --parents -n1 D
0000002 0000001

You see commit A listed twice, which is exactly what is triggering the issue in your question.

Depending on what you're trying to do, the easiest solution might be to iterate over the output of git rev-list HEAD, which will only list a commit once.

Upvotes: 2

Related Questions