Smalcat
Smalcat

Reputation: 231

64bit MessageDigest - store short texts as long

I want to represent short texts (ie word, couple of words) as a 64 bit hash (want to store them as longs)

MessageDigest.getInstance("MD5") returns 128bits.

Is there anything else I could use, could i just peel off half of it. I am not worried of someone trying to duplicate a hash, I would like to minimize the number of clashes (two different strings having the same hash)

Upvotes: 1

Views: 2035

Answers (5)

ZZ Coder
ZZ Coder

Reputation: 75466

You can just use any part of the MD5 hash.

We tried to fold 128-bit into 64-bit with various algorithms but the folding action didn't make any noticeable difference in hash distribution.

Why don't you just using hashCode() of String? We hashed 8 million Email addresses into 32-bit integer and there are actually more collisions with MD5 than String hashCode. You can run hashCode twice (forward and backward) and make it a 64-bit long.

Upvotes: 2

Rostislav Matl
Rostislav Matl

Reputation: 4543

What about using some block cipher with 64bit block size ?

Upvotes: 0

David Soroko
David Soroko

Reputation: 9086

MD5 (and SHA) hash "smear" the data in a uniform way across the hashed value so any 64 bits ypu choose out of the final value will be as sensitive to a change as any other 64 bits. Your only concern will be the increased probability of collisions.

Upvotes: 2

Michael Borgwardt
Michael Borgwardt

Reputation: 346327

As a cryptographic hash (even one nowadays considered broken), MD5 has no significant correlation between input and output bits. That means, simply taking the first or last half will give you a perfectly well-distributed hash function. Anything else would never have been seriously considered as a cryptographic hash.

Upvotes: 1

mdma
mdma

Reputation: 57707

You can take a sampling of 64-bits from the 128-bit hash. You cannot guarantee there will be no clashes - only a perfect hash will give you that, and there is no perfect hash for arbitrary length strings) but the chances of a clash will be very small.

As well as a sampling, you could derive the hash using a more complex function, such as XOR consecutive pairs of bits.

Upvotes: 1

Related Questions