Reputation: 1937

git finding duplicate commits (by patch-id)

I'd like a recipe for finding duplicated changes. patch-id is likely to be the same but the commit attributes may not be.

This seems to be an intended use of patch-id:

git patch-id --help

IOW, you can use this thing to look for likely duplicate commits.

I imagine that stringing together "git log", "git patch-id" and uniq could do the job badly but if someone has an command that does the job well, I'd appreciate it.

Upvotes: 14

Answers (8)

Guildenstern

Reputation: 3771

git-cherry(1) is the tool for finding duplicate (applied with git-rebase(1), git-cherry-pick(1) or something else) commits in some other branch. As mentioned in robinst’s answer.

git-patch-id(1) is nice for building a map of commits/patch-ids if you want to do more lookups. Like in a program which finds duplicate commits for multiple branches or something. [1]

But note that git-patch-id(1) can take however many commits with diffs as you want. You don’t have to feed it only one such thing. In other words you don’t need this (pseudo code): [2]

for each r in rev-list <commit>:
    git diff-tree --patch on <r> | git patch-id

You can do this instead: [3]

git rev-list --no-merges HEAD \
    | git diff-tree --patch --stdin \
    | git patch-id --stable

The above takes 11.115s to process 17,039 commits (with redirection to file).

This one below takes 2m23s (with redirection to file).

for r in $(git rev-list --no-merges HEAD); do
    git diff-tree --patch $r | git patch-id --stable
done

Notes

git-cherry(1) only reports which of your commits are applied or not—it doesn’t report what other commit it matches. So git-patch-id(1) provides more information if you need full reporting.
git-diff-tree(1) seems better than git-show(1) since you don’t need the commit message/metadata that git-show(1) gives you. git-patch-id(1) can deal with it just fine, but it’s just more stuff for it to skip through. man git-patch-id also mentions “git diff-tree output” in its man page [4]

Note that this is the output of the git-diff-tree(1) invocation we are using (both the commit hash as well as the diff):
```
723a8c578297e1dcc6a319576af750971391f799
diff --git <file>
```
--stable here since it seems more robust (normalization)
git version 2.47.0

Upvotes: 1

VonC

Reputation: 1324208

Make sure to use a recent version of Git (2.39 or more)

The git log --format=%H mentioned by the OP bsb's answer is not always unique.

That is because, before Git 2.29 (Q4 2020), the patch-id computation did not ignore the "incomplete last line" marker like whitespaces.

See commit 82a6201 (19 Aug 2020) by René Scharfe (rscharfe).
^{(Merged by Junio C Hamano -- gitster -- in commit 5122614, 24 Aug 2020)}

patch-id: ignore newline at end of file in diff_flush_patch_id()

^{Reported-by: Tilman Vogel}
^{Initial-test-by: Tilman Vogel}
^{Signed-off-by: René Scharfe}

Whitespace is ignored when calculating patch IDs.
This is done by removing all whitespace from diff lines before hashing them, including a newline at the end of a file.
If that newline is missing, however, diff reports that fact in a separate line containing "\ No newline at end of file\n", and this marker is hashed like a context line.

This goes against our goal of making patch IDs independent of whitespace.

Use the same heuristic that 2485eab55cc (git-patch-id: do not trip over "no newline" markers, 2011-02-17) added to git patch-id^(man) instead and skip diff lines that start with a backslash and a space and are longer than twelve characters.

A "patch ID" is nothing but a SHA-1 of the diff associated with a patch, with whitespace and line numbers ignored

Actually, git patch-id will evolve with Git 2.39 (Q4 2022).

A new "--include-whitespace" option is added to "git patch-id"^(man), and existing bugs in the internal patch-id logic that did not match what "git patch-id" produces have been corrected with Git 2.39 (Q4 2022).

See commit 0d32ae8, commit 2871f4d, commit 93105ab, commit 0df19eb, commit 51276c1, commit 0570be7 (24 Oct 2022) by Jerry Zhang (jerry-skydio).
^{(Merged by Taylor Blau -- ttaylorr -- in commit 160314e, 30 Oct 2022)}

builtin: patch-id: add --verbatim as a command mode

^{Signed-off-by: Jerry Zhang}
^{Signed-off-by: Junio C Hamano}

There are situations where the user might not want the default setting where patch-id strips all whitespace.
They might be working in a language where white space is syntactically important, or they might have CI testing that enforces strict whitespace linting.
In these cases, a whitespace change would result in the patch fundamentally changing, and thus deserving of a different id.

Add a new mode that is exclusive of --stable and --unstable called --verbatim.
It also corresponds to the config patchid.verbatim = true.
In this mode, the stable algorithm is used and whitespace is not stripped from the patch text.

Users git of --unstable mainly care about compatibility with old versions, which unstripping the whitespace would break.
Thus there isn't a use case for the combination of --verbatim and --unstable, and we don't expose this so as to not add maintenance burden.

fixes https://github.com/Skydio/revup/issues/2

git patch-id now includes in its man page:

--verbatim

Calculate the patch-id of the input as it is given, do not strip any whitespace.

This is the default if patchid.verbatim is true.

But that is not all.
From the OP:

I'd like a recipe for finding duplicated changes. patch-id is likely to be the same but the commit attributes may not be.

That is also fixed with Git 2.39:

patch-id: fix patch-id for mode changes

^{Signed-off-by: Jerry Zhang}

Currently patch-id as used in rebase and cherry-pick does not account for file modes if the file is modified.
One consequence of this is that if you have a local patch that changes modes, but upstream has applied an outdated version of the patch that doesn't include that mode change, "git rebase"^(man) will drop your local version of the patch along with your mode changes.
It also means that internal patch-id doesn't produce the same output as the builtin, which does account for mode changes due to them being part of diff output.

Fix by adding mode to the patch-id if it has changed, in the same format that would be produced by diff, so that it is compatible with builtin patch-id.

And last difference which was not properly detected/reported:

builtin: patch-id: fix patch-id with binary diffs

^{Signed-off-by: Jerry Zhang}

"git patch-id"^(man) currently does not produce correct output if the incoming diff has any binary files.
Add logic to get_one_patchid to handle the different possible styles of binary diff.
This attempts to keep resulting patch-ids identical to what would be produced by the counterpart logic in diff.c, that is it produces the id by hashing the a and b oids in succession.

In general we handle binary diffs by first caching the object ids from the "index" line and using those if we then find an indication that the diff is binary.

The input could contain patches generated with "git diff --binary"^(man)".
This currently breaks the parse logic and results in multiple patch-ids output for a single commit.
Here we have to skip the contents of the patch itself since those do not go into the patch id.
--binary implies --full-index so the object ids are always available.

When the diff is generated with --full-index there is no patch content to skip over.

When a diff is generated without --full-index or --binary, it will contain abbreviated object ids.
This will still result in a sufficiently unique patch-id when hashed, but does not match internal patch id output.
We'll call this OK for now as we already need specialized arguments to diff in order to match internal patch id (namely -U3).

Upvotes: 1

James Close

Reputation: 932

For anyone wanting to do this on windows powershell the equivalent command to unagi's answer is:

git rev-list --no-merges --all  | %{&git.exe show $_} | 
  git patch-id | ConvertFrom-String -PropertyNames PatchId, Commit | 
  Group-Object PatchId | Where-Object count -gt 1 | 
  %{$_.group.Commit + " "}

Gives an output like:

1605e0e1e13d7b3f456c20432d8edec664ca7117
1e8efa8f2f01962a2c08fd25caf687d330383428

b45b6db084b27ae420ac8e9cf6511110ebb46513
4a2e1e3ba5a9a1d5db1d00343813e1404f6124e2

With the duplicate commit hashes grouped together.

CAUTION: On my repo this was a slow command so be sure to filter the call to rev-list appropriately!

Upvotes: 0

unagi

Reputation: 466

To search for duplicate commits of commit $hash, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | grep ^$(git show $hash|git patch-id|cut -c1-40) | cut -c42-80 \
    | xargs -r git show -s --oneline

For searching the duplicate of a merge commit $mergehash, replace $(git show $hash|git patch-id|cut -c1-40) above by one of the two patch IDs (1st column) given by git diff-tree -m -p $mergehash | git patch-id. They correspond to the diffs of the merge commit with each of its two parents.

To find duplicates of all commits, excluding merge commits:

git rev-list --no-merges --all | xargs -r git show | git patch-id \
    | sort | uniq -w40 -D | cut -c42-80 \
    | xargs -r git log --no-walk --pretty=format:"%h %ad %an (%cn) %s" --date-order --date=iso

The search for duplicate commits can be extended or limited by changing the arguments to git rev-list, which accepts numerous options. For example, to limit the search to a specific branch specify its name instead of the option --all; or to search in the last 100 commits pass the arguments HEAD ^HEAD~100.

Note that these commands are fast since they use no shell loop, and batch-process commits.

To include merge commits, remove the option --no-merges, and replace xargs -r git show by xargs -r -L1 git diff-tree -m -p. This is much slower because git diff-tree is executed once per commit.

Explanation:

The first line generates a map of the patch IDs with the commit hashes (2-column data, of 40 characters each).
The second line only keeps commit hashes (2nd column) corresponding to the duplicate patch IDs (1st column).
The last line prints custom information about the duplicate commits.

Upvotes: 3

Gidfiddle

Reputation: 121

The nifty command suggested by bsb requires a couple of small tweaks:

(1) Instead of git show, which runs git diff-tree --cc, the command should use

    git diff-tree -p

Otherwise git patch-id generates spurious null SHA1 hashes.

(2) When the pipe to xargs is used, xargs should have the -L 1 argument. Otherwise a triplicated commit will not be paired with an equivalent commit.

Here's an alias to go in ~/.gitconfig:

dup = "!f() { for c in $(git rev-list HEAD); do git diff-tree -p $c | git patch-id; done | perl -anle '($p,$c)=@F;print \"$c $s{$p}\" if $s{$p};$s{$p}=$c' | xargs -L 1 git show -s --oneline; }; f" # "git dup" lists duplicate commits

Upvotes: 2

Slipp D. Thompson

Reputation: 34913

For looking for duplicates of a specific commit, this may work for you.

First, determine the patch id of the target commit:

$ THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF='7a3e67c'
$ git show $THE_COMMIT_REF_OR_SHA_YOURE_SEEKING_DUPES_OF | git patch-id
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3

The first SHA is the patch-id. Next, list the patch ids for every commit and filter out any that match:

$ for c in $(git rev-list --all); do git show $c | git patch-id; done | grep 'f6ea51cd6acd30cd627ce1a56e2733c1777d5b52'
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 5028e2b5500bd5f4637531e337e17b73f5d0c0b1
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 7a3e67ce38dbef471889d9f706b9161da7dc5cf3
f6ea51cd6acd30cd627ce1a56e2733c1777d5b52 929c66b5783a0127a7689020d70d398f095b9e00

All together, with a few extra flags, and in the form of a utility script:

test ! -z "$1" && TARGET_COMMIT_SHA="$1" || TARGET_COMMIT_SHA="HEAD"

TARGET_COMMIT_PATCHID=$(
git show --patch-with-raw "$TARGET_COMMIT_SHA" |
    git patch-id |
    cut -d' ' -f1
)
MATCHING_COMMIT_SHAS=$(
for c in $(git rev-list --all); do
    git show --patch-with-raw "$c" |
        git patch-id
done |
    fgrep "$TARGET_COMMIT_PATCHID" |
    cut -d' ' -f2
)

echo "$MATCHING_COMMIT_SHAS"

Usage:

$ git list-dupe-commits 7a3e67c
5028e2b5500bd5f4637531e337e17b73f5d0c0b1
7a3e67ce38dbef471889d9f706b9161da7dc5cf3
929c66b5783a0127a7689020d70d398f095b9e00

It isn't terribly speedy, but for most repos should get the job done (just measured 36 seconds for a repo with 826 commits and a 158MB .git dir on a 2.4GHz Core 2 Duo).

Upvotes: 12

bsb

Reputation: 1937

I have a draft that works on a toy repo, but as it keeps the patch->commit map in memory it might have problems on large repos:

# print commit pairs with the same patch-id
for c in $(git rev-list HEAD); do \
    git show $c | git patch-id; done \
| perl -anle '($p,$c)=@F;print "$c $s{$p}" if $s{$p};$s{$p}=$c'

The output should be pairs of commits with the same patch-id (3 duplicates A B C come out as "A B" then "B C").

Change the git rev-list command to restrict the commits checked:

git log --format=%H HEAD somefile

Append "| xargs git show" to view the commits in detail, or "| xargs git show -s --oneline" for a summary:

0569473 add 6-8
5e56314 add 6-8 again
bece3c3 comment
e037ed6 add comment again

It turns out patch-id didn't work in my original case as there were additional changes in that later commit. "git log -S" was more useful.

Upvotes: 4

robinst

Reputation: 31417

Because the duplicate changes are likely to be not on the same branch (except when there are reverts in between them), you could use git cherry:

git cherry [-v] [<upstream> [<head> [<limit>]]]

Where upstream would be the branch to check for duplicates of changes in head.

Upvotes: 13

git finding duplicate commits (by patch-id)

Answers (8)

Notes

`patch-id`: ignore newline at end of file in `diff_flush_patch_id()`

`builtin`: patch-id: add `--verbatim` as a command mode

`--verbatim`

`patch-id`: fix `patch-id` for mode changes

`builtin`: patch-id: fix patch-id with binary diffs

Related Questions

git finding duplicate commits (by patch-id)

Answers (8)

Notes

patch-id: ignore newline at end of file in diff_flush_patch_id()

builtin: patch-id: add --verbatim as a command mode

--verbatim

patch-id: fix patch-id for mode changes

builtin: patch-id: fix patch-id with binary diffs

Related Questions

`patch-id`: ignore newline at end of file in `diff_flush_patch_id()`

`builtin`: patch-id: add `--verbatim` as a command mode

`--verbatim`

`patch-id`: fix `patch-id` for mode changes

`builtin`: patch-id: fix patch-id with binary diffs