Reputation: 45632
I am making an application that stores documents and gives each one a UID based on a SHA1 digest of a few things including the timestamp. The digest has a lot of characters, and I want to allow users to identify the documents by using the first x characters of the full digest. What's a good value for x if the number of documents is maybe around 10K - 100K?
Upvotes: 28
Views: 14497
Reputation: 11728
Here are a few key things to consider:
What is the theoretical risk of a collision. As others have mentioned, this is a generalization of the "birthday problem".
What is the SHA-1 hash chance of a collision, compared to the theoretical one. Ideally, it should be the same, as in the case of a truly random distribution. You can take a look at some practical tests that show pretty good results for cases similar to yours.
What you consider to be an acceptible risk. Also, how bad for you application it would be if a collision did happen. For some a 10% risk might be acceptable. For others a 0.0001% risk might be unacceptable.
Depending on your case, a good strategy might be to use more bits internally, but only display a part of them to the user. Consider what git does - it uses all the 160 bits of a SHA-1 hash internally, but usually only a short prefix of the whole HEX representation is displayed to the user.
With all that in mind, for up to 100K documents, I'd probably use a 64-bit hash. With that, the risk for a collision would be lower than the average's person risk of getting hit by a thunder at least twice in their lifetime. Most DBs and programming languages have a native 64-bit integer type, so that would be convenient, fast and memory efficient.
Then, you need to decide what the best way to display that hash to your users is. If you display it as a HEX, that would be 16 hex-chars for the whole 64-bit hash. I'd consider displaying it in a format that is easier for humans to visually compare values. Example: 72af-88c1-d2a6-5c2e
.
Upvotes: 1
Reputation: 231223
Adapting the formulas on on wikipedia for the Birthday problem, you can approximate the probability of collision as 1 - e^(-n^2/(2^(b+1)))
, where n
is the document count and b
is the number of bits. Graphing this formula with n=100,000, it looks like you'll want b > 45 at least. I'd be more inclined to go with 64 to make it a nice and round number. That said, do have a plan to deal with collisions if they occur (maybe alter the timestamp slightly, or add a nonce?)
For that matter, if the sha1 is based on more than just the content of the document, why not simply make it a random ID? In this case collisions are less of a problem, as you can always generate a new random number and try again (the probability of a collision with a single try is the same, however).
Upvotes: 26
Reputation: 102306
Be careful of truncation as there is no reduction in proof that the smaller hash is secure. See Kelsey's http://csrc.nist.gov/groups/ST/hash/documents/Kelsey_Truncation.pdf. Kelsey gives to heuristic arguments stating the same ("Related Hash Outputs" and "Near Collisions"). Biham/Chen offer examples of Near Collisions; and Knudsen demonstrates Truncated Differentials.
In the end, you probably want to feed your data into an HMAC with the truncated size (the size is digested by the HMAC, too) and then use the truncated HMAC.
Upvotes: 3
Reputation: 134631
It's a generalization of the birthday problem. In you case n is number of documents, and instead of constant 365 you'd have number of possibilities the cutoff gives you (so for k bits it's 2k).
Of course exact calculation is out of the question, but you might use approximation.
Upvotes: 1
Reputation: 5211
Well, here's a possibly too simplistic of an answer..
If with full sha1 you get about 1 in 2^160 chance of collision, then by truncating one character you increase the chances of collision by 16 (all possible values of the truncated character)... which is 2^4.. So, if you truncate x characters you get 1 in 2^(160 - 4*x) chances of collision.. right?
Upvotes: 0
Reputation: 185643
There really isn't a value for this; part of what makes SHA a good general-purpose hashing algorithm is that similar data does not necessarily produce similar hashed values. Your best bet (without knowing anything else about your system) would just be to search the list of documents whose hashes start with the value supplied by the user, then either present them with a list of documents to select from or go directly to the document if there's only one.
Upvotes: 2