Reputation: 15141
Description
I have a rather large set of (string, string, string) unique tuples (around 40mln but can get bigger). For tuple each I compute a single unsigned int value. I would like to store those values somewhere so after generating them they could be reused (even after the application goes down so in memory storage is out of the question, so are databases unfortunately).
At first I stored them in a file as a tuple (string, string, string, value) but reading in 40mln records takes time (and I need it almost instantly).
I decided to first compute the hash value of each (string, string, string) tuple then to normalize it to [0, n] (where n is the number of values) and store only the values in a binary file in a sorted order (sorted by the normalized hash value). After that I can simply mmap() this file and get the values with mmap[normalize(hash(string, string, string))].
My hash function is pretty straightforward but fast and works in my case (didn't notice any collisions):
concatenatedString = s1+"."+s2+"."+s3
unsigned int hash = 31;
for(int i = 0; i < concatenatedString.length(); i++) {
hash = hash * 101 + (unsigned int) concatenatedString[i];
}
Same with normalization (straightforward):
((long) n * hash) / max_value
n - the upper bound of my normalized range (so around 40mln, I take n not (n - lower_bound) because lowe_bound = 0)
max_value - maximum value of the old set (UINT_MAX in my case, min_value = 0 so I do not include it into the equation)
Problem
My hash function doesn't produce uniformly distributed values (don't see how it even could do that) in the range of 0 to 4,294,967,295 (unsigned int). Because of this after normalization I have quite a few collisions which result in data loss (overwriting values under the same array index).
Are there any clever ways to do what I want to do but without those collisions?
I am fully aware some collision might occur. The thing is with my approach they tend to occur way too often. My hashing range is 100 times bigger than the number of my elements so I'm guessing there might be a way to do this but I haven't yet figured out how.
Solution In the end I changed my hash to Murmurhash, changed my normalizing method to a simple "modulo newRange" and changed the format of the file (I store all the data now (string string string value)) - the file is pretty big now but thanks to that I was able to implement a simple collision detection mechanism (double hashing).
Upvotes: 3
Views: 1985
Reputation: 145389
You're saying that you're using this for normalization:
((unsigned int) n * hash) / max_value
and you say that max_value
is UINT_MAX
:
“max_value - maximum value of the old set (UINT_MAX”
And hash
is declared as unsigned int
.
Well, you know, then the above can only produce the values 0 and 1, which guarantees collisions.
Are you aware of the difference between integer and floating point division in C++?
If not, then I suggest getting a C++ textbook.
By the way, the casts, like "(unsigned int) blah", are a sure way to create bugs. They tell the compiler to shut up, to not tell you about possible problems, because, you're telling it, you know better. But you don't.
Upvotes: 1
Reputation: 17928
As far as i understand you need a unique hash (which is actually impossible :) ):
In Java String.hashCode() gives you a 32 bit hash code.
If you want (say) a 64-bit hashcode you can easily implement it yourself.
If you want a cryptographic hash of a String, the Java crypto libraries include implementations of MD5, SHA-1 and so on. You'll typically need to turn the String into a byte array, and then feed that to the hash generator / digest generator. For example, see @Boris Pavlović's answer.
If you want a unique hash code, you are out of luck. Hashes and hash codes are non-unique.
A Java String of length N has 65536 ^ N possible states, and requires an integer with 16 * N bits to represent all possible values. If you write a hash function that produces integer with a smaller range (e.g. less than 16 * N bits), you will eventually find cases where more than one String hashes to the same integer; i.e. the hash codes cannot be unique. This is called the Pigeonhole Principle, and there is a straight forward mathematical proof. (You can't fight math and win!)
Upvotes: 0
Reputation: 1144
I'm actually surprised that you are not getting collisions before you normalize the range of hash values. It looks like you are using an unnormalized range of [0,2^32). Looking at the Birthday Problem chart here the probability of collision with 4*10^7 elements should be higher than 75%. In any case, normalizing the hash outputs to a range equal to the size of the set of elements is practically guaranteeing a non-trivial number of collisions. Unless you're willing to use a counter for your hash values I don't see how you'll be able to avoid that.
EDIT: Saw your edit. Even with a range 100 times the number of elements (which is about 4*10*9) you are still likely to get a lot of collisions. As indicated above, the probability of one or more collisions is well over 75%.
There are two things I would suggest:
Choose a Different Hash Function
As you noted, while your hash function is fast, it will not distribute values randomly in the range [0,2^32). There are several hash functions out there that are both fast and do a better job distributing hash values across the hash function range. One that I've used in the past is the MurmurHash.
Use a Larger Range
Using a larger range should reduce the risk of collisions. Looking at the chart here again it looks like 64 bits should be enough to reduce the risk of collision to less than 10^-6. The MurmurHash64A and MurmurHash64B variants will be useful in this case.
Upvotes: 4
Reputation: 2857
It is not always possible normalize hashes to unique [0..n] values.
I can suggest to you 2 approaches:
Upvotes: 1