Reputation: 16900
I am trying to write a script that performs a check on every commit, and for that check I need to know the parents of the commit. After the check, I follow the same procedure with the parent commits.
My problem is that I encounter the same commit multiple times – so unless I have a cycle in my repository, I probably do something wrong.
import subprocess
def parents(rev):
args = ['git', 'rev-list', '--parents', '-n', '1', rev]
output = subprocess.check_output(args, stderr=subprocess.PIPE).decode()
items = output.split()
return items[1:] # First SHA is the ID of the revision that we passed into the command
revisions = parents('HEAD')
visited = set()
while revisions:
rev = revisions.pop()
assert rev not in visited, rev
visited.add(rev)
print(rev) # TODO: Do check on commit
revisions += parents(rev)
I would expect this to print somethig similar to git rev-list HEAD
, but the assertion fires after a while.
Why do I encounter the same commit twice with this method? Is my assumption incorrect that following the parents of a commit allows me to traverse the full history?
Upvotes: 2
Views: 146
Reputation: 1329632
Note: with Git 2.22 (Q2 2019), git rev-list --parents
will still visit the same commit multiple times, but will do so faster, because of a performance fix for "rev-list --parents -- pathspec
".
See commit 8320b1d (04 Apr 2019) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit d9d65e9, 25 Apr 2019)
revision
: use aprio_queue
to hold rewritten parentsThis patch fixes a quadratic list insertion in
rewrite_one()
when pathspec limiting is combined with--parents
.What happens is something like this:
- We see that some
commit X
touches the path, so we try to rewrite its parents.rewrite_one()
loops forever, rewriting parents, until it finds a relevant parent (or hits the root and decides there are none). The heavy lifting is done byprocess_parent()
, which usestry_to_simplify_commit()
to drop parents.process_parent()
puts any intermediate parents into the&revs->commits
list, inserting by commit date as usual.So if
commit X
is recent, and then there's a large chunk of history that doesn't touch the path, we may add a lot of commits to&revs->commits
.
And insertion by commit date isO(n)
in the worst case, making the whole thing quadratic.We tried to deal with this long ago in fce87ae (Fix quadratic performance in rewrite_one., 2008-07-12, v1.5.6.6).
In that scheme, we cache the oldest commit in the list; if the new commit to be added is older, we can start our linear traversal there. This often works well in practice because parents are older than their descendants, and thus we tend to add older and older commits as we traverse.But this isn't guaranteed, and in fact there's a simple case where it is not: merges.
Imagine we look at the first parent of a merge and see a very old commit (let's say 3 years old). And on the second parent, as we go back 3 years in history, we might have many commits. That one first-parent commit has polluted our oldest-commit cache; it will remain the oldest while we traverse a huge chunk of history, during which we have to fall back to the slow, linear method of adding to the list.Naively, one might imagine that instead of caching the oldest commit, we'd start at the last-added one. But that just makes some cases faster while making others slower (and indeed, while it made a real-world test case much faster, it does quite poorly in the perf test include here).
Fundamentally, these are just heuristics; our worst case is still quadratic, and some cases will approach that.Instead, let's use a data structure with better worst-case performance.
Swapping outrevs->commits
for something else would have repercussions all over the code base, but we can take advantage of one fact: for therewrite_one()
case, nobody actually needs to see those commits inrevs->commits
until we've finished generating the whole list.That leaves us with two obvious options:
We can generate the list unordered, which should be O(n), and then sort it afterwards, which would be
O(n log n)
total. This is "sort-after
" below.We can insert the commits into a separate data structure, like a priority queue. This is "
prio-queue
" below.I expected that
sort-after
would be the fastest (since it saves us the extra step of copying the items into the linked list), but surprisingly theprio-queue
seems to be a bit faster.Here are timings for the new
p0001.6
for all three techniques across a few repositories, as compared tomaster
:master cache-last sort-after prio-queue -------------------------------------------------------------------------------------------- GIT_PERF_REPO=git.git 0.52(0.50+0.02) 0.53(0.51+0.02) +1.9% 0.37(0.33+0.03) -28.8% 0.37(0.32+0.04) -28.8% GIT_PERF_REPO=linux.git 20.81(20.74+0.07) 20.31(20.24+0.07) -2.4% 0.94(0.86+0.07) -95.5% 0.91(0.82+0.09) -95.6% GIT_PERF_REPO=llvm-project.git 83.67(83.57+0.09) 4.23(4.15+0.08) -94.9% 3.21(3.15+0.06) -96.2% 2.98(2.91+0.07) -96.4%
Upvotes: 0
Reputation: 312770
The behavior you are seeing is intrinsic to the git rev-list --parents
command. Consider a repository that looks like this:
A--B--C
\ /
D
The output of git log --oneline
might be:
0000004 (HEAD -> master) Merge branch "mybranch"
0000003 B
0000002 D
0000001 A
But commit A
is the parent of both B
and D
. So for B
:
$ git rev-list --parents -n1 B
0000003 0000001
And for D
:
$ git rev-list --parents -n1 D
0000002 0000001
You see commit A
listed twice, which is exactly what is triggering the issue in your question.
Depending on what you're trying to do, the easiest solution might be to iterate over the output of git rev-list HEAD
, which will only list a commit once.
Upvotes: 2