Reputation: 757
So I don't need a cryptographic hash function, but I do need to be sure that the likelihood of a collision is astronomically low. But I don't have an advisory, I'm generating all the objects that will be hashed. My objects are very small but contain hashes inside of them and are thus "high entropy" in some sense. But the objects will range in size from 8 to 36 bytes (assuming 128-bit hashes) and are thus very very small. The smaller objects can just be hashed as themselves but the larger objects (in the 20-36 byte range) need to be hashed down to 128-bits.
The objects essentially form a directed graph of objects with a small amount of metadata in each one, and the hashes act as pointers to previously created objects. I want to be able to generate as many nodes in this graph per second as possible but at the end of the day it is required that there be a unique, ideally there would be at least 128-bits but I could be argued into 64-bits. More than 128-bits is not needed and I'll likely just xor reduce any larger hashes down to 128-bits.
Currently I'm using smhasher to explore the zoo of possibilities: siphash, xxhash128, metrohash128, farmhash128, cityhash128, cityhash128_crc, cityhash64, Pearson block hash 128, prvhash64_128, spooky hash 128.
I'll probably do a test run where I simulate a very dumbed down version so that I can just see how likely a collision is in practice. I also might not find libraries for each of these that I like. The project is in rust so having a good library that already "does the thing" in rust is nice. Additionally being able to implement myself easily (such as with fnv128) is a huge benefit.
Does anyone have any hash functions or things I should read about?
Upvotes: 1
Views: 444
Reputation: 76409
If you want a hash where it is infeasible to find collisions, then you need a collision-resistant cryptographic hash function. None of the options you've listed fit into that category.
SipHash is a PRF and must be used with a secret key in order to be secure. It is not a general hash function in the sense of keyless options.
You also need to be careful that even with a cryptographic hash function, with a very short output like you're suggesting, the risk of a collision is still significant due to the birthday paradox. If you use a 64-bit hash, the likelihood of a collision with 3 trillion nodes is very high. It is much less with a 128-bit hash, but we typically still consider that too high for cryptographic purposes, although you may judge it acceptable.
The cryptographic options you have, assuming performance is a major constraint, are BLAKE2s, BLAKE2b, BLAKE3, and SHA-256. SHA-256 is accelerated in hardware in many recent Intel and AMD chips (although the extensions are new enough that there's still hardware in use that doesn't support them), and, on my Core i7-1280P, can process 124454 kB/s on 16-byte chunks. You can truncate it to 128 bits if need be.
BLAKE2s and BLAKE2b are also very fast and can be explicitly truncated to 128-bit outputs. BLAKE2b will be faster on 64-bit machines, and BLAKE2s will be faster on 32-bit machines, and where hardware acceleration is not possible, they will be faster than SHA-256. BLAKE3 is also a possibility and is even faster than BLAKE2. It's also parallelizable on large data (making it very fast), which isn't a concern for you, and it can also be truncated to an arbitrary size.
There are Rust crates for all of these algorithms. You can certainly implement them yourself, but using the Rust crate will be less error-prone and probably a bunch faster. I have implemented most of them by hand, and I would also recommend against doing so (and just use the crates).
Upvotes: 0