Reputation: 215
I am trying to find something potentially faster then SHA256. I have over 1 billion records I need to hash and verify if they are unique. I am currently running it through an MD5 which seems pretty fast then through the sha256 to avoid collisions. Running them in that order seems to give me a little performance boost but I still need it faster. I am looking for the names or examples of some hashes done in c# or some pseudo-code so I can recreate it in c#.
Upvotes: 4
Views: 8973
Reputation: 3671
From the way you phrased the question, it does not appear you need a security grade hash algorithm. You may not need a hash algorithm at all if you have conveyed all of the primary requirements of what you are trying to accomplish.
If you are constructing a method called unique that returns a Boolean true if and only if the two rows are unique, you can gain speed and retain reliability by using the following three row characteristics in this order.
The first is probably already known if the record length is variable. The second can be quickly calculated at the time of storage. With a billion records, you will have to cover the possibility of collisions even if you use security grade hash algorithms (which you stated are too slow anyway). So when the checksum matches, which will be rare if you have a sufficient number of bits in the checksum, you will have to cover the case of comparing actual values byte by byte.
Upvotes: 0
Reputation: 10989
There is a lot of dubious information in the answers here. You tagged your question with cryptography
and only mention cryptographic hash functions, but it sounds like you don't really need cryptographic security, in particular because you say:
I have over 1 billion records I need to hash and verify if they are unique.
There are four properties to a cryptographic hash function:
- it is easy to compute the hash value for any given message
- it is infeasible to generate a message that has a given hash
- it is infeasible to modify a message without changing the hash
- it is infeasible to find two different messages with the same hash.
You're really only interested in the first quality and the uniqueness is a smaller scale requirement only partially related to the other three properties of cryptographic security.
There is overhead in cryptographic security. You don't need it, and you're interested in speed, so why not skip it? The hash width of MD5 and the SHA family are admittedly large enough for your purposes.
Check out the list of hash functions on wikipedia, or check out the article on normal hash functions. More to the point, what's wrong with the built in .NET hashing functions? Have you tried just deferring to the Object.GetHashCode()
method? That MSDN reference has a lot to say about using hash functions. You don't say much about the data you're hashing, so it's hard to say whether the output would be unique between your objects or not. How are you feeding the object into the MD5 hasher? I presume you're taking the binary representation of it. A similar approach could be used to use the built-in non-crypto hash function.
You may be concerned about the uniqueness of the built in hash functions. They do only return a regular int, which is 2^32, only about 4 times larger than the data set you're working with. However, you always need to have a back up plan for hash functions. Collisions are infeasible, not impossible. The standard fallback is to perform a more expensive comparison, usually a reference comparison and a field-wise value comparison.
If you aren't prepared to do an exact comparison on your hash outputs, you're basically counting down until you get a false positive. That might not be a big deal for you: only you can judge what the downside there is.
Additionally, performing another hash function computation is probably not much faster than the direct comparison. You're better off on all counts to go with the sure thing and perform the lengthy, direct comparison.
Another common anti-collision technique is to use multiple keys. So if your data points have several large subcomponents, you hash and compare the independently. If it has some large and some small components (say some simple numeric types), you hash the large ones and do a direct comparison on the small ones. If they have some data that's easy to take the ordinal of (say the lengths of strings or the size of some containers), you can perform the direct comparison on those bits.
If that doesn't work out for you, take a look at implementations of the other hash functions listed on the wiki. Here's a pretty good reference for MurmerHash3, which can compute 32 bit or 128 bit hash values. There are other hash functions in the list that have long hash widths as well and also have C# libraries available. But as that reference points out, Murmurhash is way faster than MD5 and SHA functions, although it doesn't do a direct comparison to the Object.GetHashCode method I mentioned above.
Upvotes: 6
Reputation: 171
You could even do something like take MD5 and if you get collision add a little extra data (same) to both the values and take MD5 again. It is highly unlikely for the 2 to have a collision again if they were different. So rather than doing SHA after the collision do MD5 again with something added which should be faster.
Upvotes: 0
Reputation: 416111
Are you checking every record with sha256? You should only need to check the records where you have md5 collisions, which should be rare even with md5. And at that point, when you're just comparing duplicates, it might be faster just to compare raw record to raw record, because the compare will return on the first difference.
Upvotes: 1
Reputation: 1064
You can use MD5 then if you encounter colliding records you can check them with SHA256 or even SHA128.
Upvotes: 1
Reputation: 13491
How about doing something different?
Use a simple hashing function on each record, like one you would use when inserting the record into a hash table, perhaps mapping each record to a 32 bit INT. Then if there were a hash collision you then compare the colliding records for uniqueness.
Upvotes: 3