What is a simple way to check if two git repositories are unrelated? For example let's assume we cloned following repositories: https://github.com/spring-petclinic/spring-framework-petclinic https://github.com/spring-projects/spring-petclinic How can I check that one doesn't share history with another? How can I check that one share partial history with another? How can I browse common DAG and view difference? Note: Git allows shallow copy + repositories history can diverge with a time...

git

gavenkoa

Reputation: 48753

What is a simple way to check if two git repositories are unrelated?

For example let's assume we cloned following repositories:

How can I check that one doesn't share history with another?

How can I check that one share partial history with another? How can I browse common DAG and view difference?

Note: Git allows shallow copy + repositories history can diverge with a time...

Upvotes: 0

Answers (3)

zigarn

Reputation: 11595

You can clone the first one, then add the second one as a additional remote:

git clone https://github.com/spring-petclinic/spring-framework-petclinic
cd spring-framework-petclinic
git remote add other https://github.com/spring-projects/spring-petclinic
git fetch --all

Then you can browse both DAGs:

git log --graph --all --oneline --decorate

And see if they have any common history by looking at the merge base of two trunks:

git merge-base origin/master other/master

Upvotes: 3

gavenkoa

Reputation: 48753

To find any common revisions:

comm -12 <(cd repo1; git rev-list --all | sort) <(cd repo2; git rev-list --all | sort)

In my case both repositories share commits:

bash# (cd spring-petclinic; git rev-list --all) | wc -l
670

bash# (cd spring-framework-petclinic; git rev-list --all) | wc -l
571

bash# comm -12 <(cd spring-framework-petclinic/; git rev-list --all | sort) <(cd spring-petclinic; git rev-list --all | sort) | wc -l
427

Upvotes: 0

torek

Reputation: 488203

It's easy to prove that two repositories are related: if they contain matching commits—commits with the same hash IDs and contents, although "same hash IDs" is usually sufficient¹—they are related.

As you note, it's much harder to prove that they are un-related unless they are both complete (non-shallow, non-single-branch) clones. If both are complete, yet neither has any commit in common with the other, the two repositories are unrelated.

If you have both repositories and verify that both are complete, simply enumerate all the commit hash IDs in both repositories and look for common IDs. If common IDs exist, the two are probably related. To enumerate all the commit hash IDs, run git rev-list --all (redirect output to file or to program that reads both sets of outputs and checks for common hash IDs).

See footnote 1 for eliminating "probably", but the TL;DR is that for now, any two identical IDs means shared history.

¹Given a uniform hash function h(k) whose range is r = |{h(k)|, the probability of an accidental collision of hashes for two distinct keys k₁, k₂ is p = 1/r. The probability of uniqueness is the complement of this is p̄ = 1 - (1/r). Generalizing to n keys and using a two-term Taylor expansion of e^x ≈ 1 + x for x ≪ 1, we get p̄ ≈ e^{(-n(n-1)) / 2r}, as long as r is reasonably large.

Git's hash function is currently SHA1, which has a pretty uniform distribution and has r = 2¹⁶⁰ = 1461501637330902918203684832716283019655932542976. This satisfies our formula, which means we can use the approximation.

Hence, if you sum up the total number of hashes n and plug it in to the formula:

r=1461501637330902918203684832716283019655932542976
1 - exp(-n*(n-1)/(2*r))

(remember, we want p, not p-bar), you get the probability of a collision. To check for an actual collision, of course, you can just compare the underlying actual objects: if the hashes match, compare the objects directly to detect a collision. But it's extremely unlikely in the first place. If we take two repositories that, together, contain ten million commits, we compute:

$ bc -l
r=2^160
n=10*1000*1000
scale=100
1 - e(-n*(n-1)/(2*r))
.0000000000000000000000000000000000342113848680412753525884397196522\
895097282878872708411144841034243

which as you can see is still pretty tiny. It's not until we get to:

n=10*1000*1000*1000*1000*1000*1000*1000

(ten sextillion objects, using the short scale notation)

that we find:

1 - e(-n*(n-1)/(2*r))
.0000342108030863093209851036344159518189002166758764416221121344549\
079733424124497666779807175655625

a noticeable chance of accidental collision, at about 0.0035%. At 100 sextillion objects we are up to a 0.35% chance:

n=100*1000*1000*1000*1000*1000*1000*1000
1 - e(-n*(n-1)/(2*r))
.0034152934013810288444649336362559390942558976421984312395097770719\
923433072593638116228277476790795

and by 1 septillion objects we're running some serious risks:

1 - e(-n*(n-1)/(2*r))
.2897326871923714506502211457721853341644126909116947422293621066225\
555385326652788789421475224989232

Fortunately, well before then, we've run out of disk space. :-) Also, the Git guys are thinking of moving to one of the SHA-256 hashes, which will raise r to 2²⁵⁶, which helps out our denominator.

(I'm using bc above, in which ^ is exponentiation and the -l library adds e(x) to compute e^x.)

Upvotes: 1

What is a simple way to check if two git repositories are unrelated?

Answers (3)

Related Questions