Reputation: 670

Hash collision in commits from all version control systems

I read Hash collision in git from this it appears what very unlikely what two different commits in git will have same hash.

But what about all commits not only git? My application working with git,svn,hg - can i assume that there will be no different commits with same hash?

For now i'm trying to deside how stop my application from creation same commits from different forks of one repo in db. I figure out what i can do hash column in db unique and if i already have commit with this hash - just skip it. But i dont know is there is a big/small chance that i will skip unique commit and not the duplicate of already existing commit.

Upvotes: 0

Answers (2)

torek

Reputation: 487705

TL;DR: you are safe unless you mix VCSes.

The problem statement in your question is not quite right in the first place:

... it appears what very unlikely what two different commits in git will have same hash.

and this (indirectly) leads to a faulty further assumption:

But what about all commits not only git? My application working with git,svn,hg - can i assume that there will be no different commits with same hash?

Even if all VCSes were perfect, you could not really make this assumption. Even if all VCSes were both perfect and used the same hash algorithm, you still could not make this assumption. But for your particular problem, there is a much simpler (albeit imperfect) answer.

For now i'm trying to deside how stop my application from creation same commits from different forks of one repo in db ...

The main thing to consider here is the concept of "forks of one repo", and how you are going to identify a particular commit.

If we look at the identity of a commit in Git or Mercurial, we find that it is a hash ID.

Two objects in Git that have the same object-ID are the same object, by definition, because Git will store any object only once. This is because the underlying storage model for Git is a simple key-value store, with the key being a hash ID. There is only one value stored under any single key.

To allow for the four object types in Git—commit, annotated tag, tree, and blob—Git stores the object's type in a header prepended to all objects. It makes the assumption that prepending the string commit <size>\0 to some data results in a different hash than prepending the string blob <size>\0 to the same data. This assumption is largely true, though the pigeonhole principle tells us that there must be some data for which it is false. (To the extent that SHA-1 is good, the chance of finding a data-pair that generates a collision is 1 in 2¹⁶⁰. The Stevens et al work shows that SHA-1 is not that good.)

In any case, though, Git's underlying storage model means that once a key has an associated value, that key/value pair is now occupied, and no pair with the same key can be stored again. Hence, if some existing key k exists, and has type commit and represents some commit, no new object of any type with key k can be added to the repository database (at least not without first removing the existing object with key k).

What this means is that if you make the assumption that commits are not removed, and if you have seen that key k exists before in any clone of this repository, any other clone with key k has the same object. The hash, in other words, is the object, in a very real sense.

This is not necessarily the case in Mercurial. Mercurial's database can store new commits that have duplicate keys (the simple local sequence number associated with each object can disambiguate them). However, such commits can never be transferred from one repository to another (and are likely to cause other problems), so you are certainly allowed to assume the problem away if the repository will be distributed.

Currently, both Git and Mercurial use SHA-1 as well—but they use them in different ways. That is, the input message on which the hash is computed differs in Git vs Mercurial. What this means is that if you have "forks" G (stored via Git) and M (stored via Mercurial) that represent the same repository, the keys k_G in G are (numerically) unrelated to the keys k_M in M.

Therefore, if you allow two different forks to use two different underlying VCSes, you cannot make the assumption that two different keys represent two different objects, nor that two identical keys represent the same object. If you constrain them to the same VCS, though, you may make this assumption.

(SVN does not identify commits by hashes at all. Since SVN repositories are centralized, they can and do use a simple unique integer to represent each commit. By converting the SVN repository to a Git repository, however, you impose the Git restrictions: you now have a repository that meets any restrictions imposed by both VCSes. Should someone add a new commit to the SVN repository that cannot be represented correctly in the Git repository, it simply never goes into the Git repository at all.)

Upvotes: 3

Francesco

Reputation: 4250

Both git and mercurial use sha1 for generating hashes, so I would say that the probability of having the same hash by two different commits one from git and one from mercurial is the same of having the same hash by two different git commits.

Svn does not use hashes to identify commits but incremental revision numbers so you do not have any collision problem here

Upvotes: 4

Hash collision in commits from all version control systems

Answers (2)

Related Questions