Reputation: 231
I want to represent short texts (ie word, couple of words) as a 64 bit hash (want to store them as longs)
MessageDigest.getInstance("MD5") returns 128bits.
Is there anything else I could use, could i just peel off half of it. I am not worried of someone trying to duplicate a hash, I would like to minimize the number of clashes (two different strings having the same hash)
Upvotes: 1
Views: 2035
Reputation: 75466
You can just use any part of the MD5 hash.
We tried to fold 128-bit into 64-bit with various algorithms but the folding action didn't make any noticeable difference in hash distribution.
Why don't you just using hashCode() of String? We hashed 8 million Email addresses into 32-bit integer and there are actually more collisions with MD5 than String hashCode. You can run hashCode twice (forward and backward) and make it a 64-bit long.
Upvotes: 2
Reputation: 9086
MD5 (and SHA) hash "smear" the data in a uniform way across the hashed value so any 64 bits ypu choose out of the final value will be as sensitive to a change as any other 64 bits. Your only concern will be the increased probability of collisions.
Upvotes: 2
Reputation: 346327
As a cryptographic hash (even one nowadays considered broken), MD5 has no significant correlation between input and output bits. That means, simply taking the first or last half will give you a perfectly well-distributed hash function. Anything else would never have been seriously considered as a cryptographic hash.
Upvotes: 1
Reputation: 57707
You can take a sampling of 64-bits from the 128-bit hash. You cannot guarantee there will be no clashes - only a perfect hash will give you that, and there is no perfect hash for arbitrary length strings) but the chances of a clash will be very small.
As well as a sampling, you could derive the hash using a more complex function, such as XOR consecutive pairs of bits.
Upvotes: 1