Reputation: 14246
I'm looking to do a custom hash table implementation in C. Is there an MD5/SHA1 hash function already in the GNU library or do I have to use an external library for this?
Here's kinda what I'm looking for:
int hashValue;
hashValue = MD5_HASH(valToHash);
Upvotes: 10
Views: 17696
Reputation: 132869
CityHash64 is probably always a good choice in general for modern x86 systems, even though it's not always your fastest option. With small amounts of data (~16 bytes), FarmHash64 may be your fastest option, with medium size amounts of data (~128 bytes), CityHash64 may be the fastest one, but the larger the data becomes (~4KB), the more everything shifts in favor of xxHash64 that will eventually dominate the other two. Most hashtable keys are rather small but if you know that your keys will be always big, you may also consider xxHash64 instead.
If you are also targeting non-x86 systems (ARM, RISC-V, PowerPC) or older x86 CPUs (especially ones from AMD), you should definitely prefer CityHash64, as all of the three mentioned above will perform worse under these conditions but CityHash64 proves to be more stable than the other two. Sometimes FarmHash64 will be the fastest one non-x86 CPUs but even when it beats CityHash64, it was a relatively close call, so CityHash64 is still a good choice.
When targeting 32 bit systems, use xxHash32 instead, that is if 32 bit hash values are sufficient, as for 32 bit, xxHash32 always produces good results and quickly dominates all others when the hashed data gets larger, already starting at medium size range.
Here's a list of good candidates if you require a hash function for a hashtable implementation:
clhash: This hash is ridiculously fast. Up to 60% faster than CityHash if the hashed value is below 64 bytes, otherwise equally fast. Yet it requires a special x86 instruction that only newer x86 CPUs support. So it's not portable to old CPUs and it's not portable to non x86 systems (e.g. ARM or RISC-V or PowerPC).
https://github.com/lemire/clhash
xxHash: Very fast and portable hash algorithm, that can make use of various CPU features and is sometimes faster than copying raw bytes with memcpy()
(which will only work if the data to hash is currently in CPU cache, of course). Unlikely most other hashes, which use bit shifting an logic instructions a lot, this hash is heavily based on multiplying numbers, which modern CPUs can do pretty fast but with older ones CityHash may deliver better performance.
https://xxhash.com
CityHash: Fast and portable algorithm. It will use some SSE, to be even faster on modern x86 CPUs, but it provides good performance on all CPUs. Note that CityHash is written in C++, but easily portable to plain C.
https://github.com/google/cityhash
MurmurHash3: Good hash, that is optimized to different x86 CPUs and assumes little endian integers. It can be compiled for non-x86 CPUs without problems, but it will suffer in speed and if those CPUs are big endian, you need to implement bit swapping in two places (the source code points those out, further reducing performance). If you don't bitswap, you break the mixing and won't get any good hash values.
https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
FarmHash: Is the successor to CityHash and basically a mix of CityHash and several concepts taken from the Murmur family. It may perform better than CityHash or it may not. Google says it performs better in their data centers but your mileage may vary.
https://github.com/google/farmhash
lookup3: Very portable hash function that only builds on standard C and that I would consider the baseline. The original MurmurHash actually was created as a faster lookup3 replacement. What do I mean by baseline? If your chosen hash algorithm cannot beat lookup3 in speed or create equally good hash values, drop it immediately. No hash that cannot even beat lookup3 is ever worth considering. Why would you even waste your time with it? Just copy the public domain lookup3 implementation into your project and move on.
https://www.burtleburtle.net/bob/c/lookup3.c
SpookyHash: The author of lookup3 also wrote another hash that is portable, can be faster than CityHash on machines without SSE but the reference implementation is optimized for x86 and little endian, pretty much like MurMur3 and in C++; yet it should be no problem to port that code to good old C.
https://www.burtleburtle.net/bob/hash/spooky.html
xxHash has a benchmark comparison right on the front page, but it misses clhash and it only contains one of the three MurMur3 variants (only the 32 bit one). Also it claims SpookHash is only 64 bit while in fact it is 128 bit (when producing 32/64 bit hashes, it just shortens the 128 bit value accordingly). I don't really trust that table, I think it has been tinkered to make xxHash look even more superior.
At the end of the day, there is no definitive answer what's the best hash to use. E.g. on a Core i7-5820K system and using a 64 bit build, xxHash64 wins at larger (0.5KB+) data sizes, followed closely by 64 bit FarmHash and CityHash; on smaller data all three are always very close and sometimes one wins, sometimes the other one. On a Phone SE (A9 CPU) and using a 64 bit process, xxHash never wins and CityHash and FarmHash dominate. Yet on the same system in a 32 bit process xxHash32 beats the crap out of the others. On an Xbox One (AMD Jaguar 1.75GHz CPU), despite being x86 with SSE, CityHash and FarmHash dominate over xxHash.
The problem is that most of the modern hash functions were written with certain CPUs in mind, make assumption about architectures, or were optimized for the needs of a certain data center operator (like Google or Facebook) and will outperform all the others when used as designed but when used in other scenarios or on other platforms, the results may be completely different. They all have excellent hash quality and produces very little collisions in practice, so that's not what sets them apart.
Check out this blog entry from 2016 for detailed results:
https://aras-p.info/blog/2016/08/09/More-Hash-Function-Tests
But one thing is for sure: They all can easily beat the crap out of MD5, which is 100 times slower than any of the ones mentioned above will be on your system, as MD5 was designed to be cryptographic safe and that requires way harder mixing than is required to just avoid collisions. Sure, MD5 also wanted to avoid collisions (two unrelated pieces of data both coincidentally producing the same hash value) but more important than that for a cryptographic hash function is the inability to reverse the hash, meaning, given a specific hash value, it must be impossible to produce data that when getting hashed will result in exactly this value. If you can produce collisions, that's very bad and allows many kind of attacks, that's why MD5 is not safe anymore (there is a known method to produce collisions and that very fast) but as of today, there is no method to produce data that will hash to a given value, as if you can do that, all gates are open to any attack you can even think of.
Upvotes: 0
Reputation: 1244
Murmur3 is a fast noncryptographic algorithm that you can use.
A good speed comparation of murmur against other algorithms can be found in this thread https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
One possible implementation: https://github.com/PeterScott/murmur3
Example:
uint32_t hash;
uint32_t seed = 42;
char* input = "HelloWorld";
MurmurHash3_x86_32(input, strlen(input), seed, &hash);
printf("x86_32: %08x\n", hash);
Upvotes: 1
Reputation: 317
gcrypt and openssl can do MD5, SHA and other hashes here's an example with libgcrypt:
#include <gcrypt.h>
#include <stdio.h>
// compile gcc md5_test.c -lgcrypt
int main(int argc, char *argv[])
{
unsigned char digest[16];
char digest_ascii[32+1] = {0,};
int digest_length = gcry_md_get_algo_dlen (GCRY_MD_MD5);
int i;
printf("hashing=%s len=%d\n", argv[1], digest_length);
gcry_md_hash_buffer(GCRY_MD_MD5, digest, argv[1], strlen(argv[1]));
for (i=0; i < digest_length; i++) {
sprintf(digest_ascii+(i*2), "%02x", digest[i]);
}
printf("hash=%s\n", digest_ascii);
}
`
Upvotes: 3
Reputation: 340168
You can take a look at Bob Jenkin's survey and analysis of many hash functions:
Or just drop his lookup3 routines (which he's put into the public domain) into your project:
Upvotes: 6
Reputation: 74382
For a hash table, you do not need cryptographic strength, only good randomization properties. Broken cryptographic hash functions (like MD5) are fine for that, but you may want to use MD4, which is both faster and simpler, to the point that you could simply include an implementation directly in your code. It is not difficult to rewrite it from the specification (and since you want only a function for a hash table, it is not really a problem if you get it wrong at some point). Shameless plug: there is an optimized C implementation of MD4 in sphlib.
Upvotes: 6
Reputation: 1333
The OpenSSL library has all the crypto routines you could ever want, including cryptographic hashes.
Upvotes: 2
Reputation: 43688
Glibc's crypt()
uses a MD5 based algorhytm if salt starts with $1$. But since you mention that you are going to do a hash table implementation, maybe Jenkins hash would be more appropiate.
Upvotes: 3
Reputation: 368181
There are a few trusted, simple versions available -- I have a few in the sources of the digest for R. Here is what I wrote in the DESCRIPTION file:
Description: The digest package provides functions for the creation of `hash' digests of arbitrary R objects using the md5, sha-1, sha-256 and crc32 algorithms permitting easy comparison of R language objects. The md5 algorithm by Ron Rivest is specified in RFC 1321, the SHA-1 and SHA-256 algorithms are specified in FIPS-180-1 and FIPS-180-2, and the crc32 algorithm is described in
ftp://ftp.rocksoft.com/cliens/rocksoft/papers/crc_v3.txt. For md5, sha-1 and sha-256, this packages uses small standalone implementations that were provided by Christophe Devine. For crc32, code from the zlib library is used.
I think some of Christophe's code is no longer at cr0.net, but searches should lead you to several other projects incorporating it. His file headers were pretty clear:
/*
* FIPS-180-1 compliant SHA-1 implementation,
* by Christophe Devine <[email protected]>;
* this program is licensed under the GPL.
*/
and his code matches the reference output.
Upvotes: 3
Reputation: 71939
Unless you already have a good reason for using MD5, you may want to reconsider. What makes for a "good" hash function in a hash table is pretty dependent on what you're trying to accomplish. You may want to read the comments in Python's dictobject.c
to see the sorts of tradeoffs others have made.
Upvotes: 3