C++ OpenSSL: md5-based 64-bits hash

Question

I know the original md5 algorithm produces an 128-bits hash.

Following Mark Adler's comments here I'm interested in getting a good 64-bits hash. Is there a way to create an md5-based 64-bits hash using OpenSSL? (md5 looks good enough for my needs). If not, is there another algorithm implemented in the OpenSSL library that can get this job done with quality not less than md5's (except for the length of course)?

Sam · Accepted Answer

I claim, that 'hash quality' is strongly related to the hash length. AFAIK, OpenSSL does not have 64bit hash algos, so the first idea I had is simple and most probably worthless:

halfMD5 = md5.hiQuadWord ^ md5.lowQuadWord

Finally, I'd simply use an algorithm with appropriate output, like crc64.

Some crc64 sources to verify:

Edit

At first glanceת Jenkins looks perfect, however I'm trying to find a friendly c++ implementation for it without luck so far. BTW, I'm wondering, since this is such a good hash for databases' duplication checking, how come that non of the common opensource libraries, like OpenSSL, provides an API of it? – Subway

This might be simply due to the fact, that OpenSSL is a crypto library in the first place, using large hash values with appropriate crypto characteristics.

Hash algos for data structures have some other primary goals, e.g. good distribution characteristics for hash tables, where small hash values are used as an index into a list of buckets containing zero, one or multiple (colliding) element(s).

So the point is, whether, how and where collisions are handled. In a typical DBMS, an index on a column will handle them itself.

Corresponding containers (maps or sets):

C++: std::size_t (32 or 64bits) for std::unordered_multimap and std::unordered_multiset
In java, one would make a mapping with lists as buckets: HashMap>

The unique constraint would additionally prohibit insertion of equal field contents:

C++: std::size_t (32 or 64bits) for std::unordered_map and std::unordered_set
Java: int (32bits) for HashMap and HashSet

For example, we have a table with file contents (plaintext, non-crypto application) and a checksum or hash value for mapping or consistency checks. We want to insert a new file. For that, we precompute the hash value or checksum and query for existing files with equal hash values or checksums respectively. If none exists, there won't be a collision, insertion would be safe. If there are one or more existing records, there is a high probability for an exact match and a lower probability for a 'real' collision.

In case collisions should be omitted, one could add an unique constraint to the hash column and reuse existing records with the possibility of mismatching/colliding contents. Here, you'd want to have a database friendly hash algo like 'Jenkins'.
In case collisions need to be handled, one could add an unique constraint to the plaintext column. Less database friendly checksum algos like crc won't have an influence on collisions among records and can be chosen according to certain types of corruption to be detected or other requirements. It is even possible to use the XOR'ed quad words of an md5 as mentioned at the beginning.

Some other thoughts:

If an index/constraint on plaintext columns does the mapping, any hash value can be used to do reasonably fast lookups to find potential matches.
No one will stop you from adding both, a mapping friendly hash and a checksum.
Unique constraints will also add an index, which are basically like the hash tables mentioned above.

In short, it greatly depends on what exactly you want to achieve with a 64bit hash algo.

C++ OpenSSL: md5-based 64-bits hash

Answers (1)

Related Questions