Reputation: 992
I want to expand on the following question and answers found here: What are the chances that two messages have the same MD5 digest and the same SHA1 digest?
My interest in this question is not about cryptographic security, but about ensuring uniqueness of keys used as unique identifiers but that are deterministic and not randomly generated, so that the same record always results in the exact same unique key value.
I'm currently using SHA256 for this, but am concerned that while improbable, I could still end up with a collision by poor luck. I am assuming that I need 1 billion unique IDs for purposes of estimating the risk/probability of collision. For SHA256, this probability seems to be approximately 4.32E-60, following the estimation suggested here, using P = k^2/(2N). I do not want to move to a larger hash, because I don't want to make the size of my keys significantly larger than they already are, which is already much more costly than an auto-incremented large integer type.
Similar to the linked question that I want to expand on, I was considering the concatenation of two hashes from two orthogonal hash algorithms - SHA1 and MD5, as this would only slightly increase the size of my key when stored.
The answers to the linked question, or at least some of the answers, seem to approach this by multiplying the probabilities that the two algorithms will both experience any collision within the number of keys generated. When I do this, I arrive at something like Probability = 5.03E-52, which is not better than SHA256.
However, they do not seem to take into account that the question is interested in the probability not just that a collision occurs, but that a collision occurs in both algorithms for the same exact same inputs, and that when these collisions on specific inputs occur, they result in exactly the same output hash values for those two algorithms. There are some caveats and uncertainties raised in the answers and comments, but it's unclear to me if any of the answers or which of the answers is actually addressing the problem from this perspective.
In short, collisions in both algorithms are okay on the whole, but they only matter if the collisions happen on the exact same inputs for both algorithms and the output values from both algorithms are also exactly the same for those colliding inputs.
Given the intersection of things that need to happen in order for non-unique keys to result, it seems that simply multiplying the probabilities of any collision happening greatly overstates the probability that a non-unique key would be generated.
I do not know how to calculate the probability of non-unique keys resulting from this, but it seems to me that it would be infinitesimally small - as in much, much, much lower than using SHA256.
Am I right here, or am I missing something, or maybe I've misinterpreted some of the answers on the linked question? Did one of those answers already adequately answer my question?
Upvotes: 0
Views: 45