Reputation: 4452
I came across a situation where i had to count the number of occurences of each word in a string. I decided hashing would be the best way to do it (Find the hash value for each word that is encountered and increment the count at the position indexed by the hash value - assuming i use an array). What hashing algorithm can i use to ensure that the hash value generated for every string is unique?
And this led to a bigger question.. How do the language libraries (Java for example) implement data structures like hashmap that generate unique hash values in case of strings?
I want to know the mathematical construct involved behind the implementation of such an algorithm.
Upvotes: 0
Views: 5434
Reputation: 2143
I think what you are looking for is Substring Index or String search. Am i missing something?
Upvotes: 0
Reputation: 2084
Source codes worth a thousand words...
String.java, look at the hashCode() method: http://www.google.com/codesearch/p?hl=zh-TW#ih5hvYJNSIA/src/share/classes/java/lang/String.java&q=String.java%20hashcode&sa=N&cd=1&ct=rc
HashMap.java, look at the put() method: http://www.google.com/codesearch/p?hl=zh-TW#ih5hvYJNSIA/src/share/classes/java/util/HashMap.java&q=hashMap.java%20%22V%20put%22&sa=N&cd=1&ct=rc
Upvotes: 0
Reputation: 93167
You can't be 100% sure, a hash by definition can have collisions.
You can see on grepcode how a String
is hashed in java. And basically HashMap
(and other hash based structures) use the hashCode()
method every time.
So if you want to count the number of iterations of a particular word, you should use a Map<String, Integer>
(in java) and count from there.
For example :
Map<String, Integer> words = new HashMap<String, Integer>();
String word = "lol";
Integer count = words.get(word);
if(count == null){
count = 0;
}
words.put(word, count + 1);
Upvotes: 1
Reputation: 241641
What hashing algorithm can i use to ensure that the hash value generated for every string is unique?
There is no such function. The space of strings is infinite, but the target space is finite (say you are using 32-bit integers). You can not injectively map an infinite space to a finite space; there must be collisions.
How do the language libraries (Java for example) implement data structures like hashmap that generate unique hash values in case of strings?
They don't; there is not a unique hashing function for strings per the above.
I came across a situation where i had to count the number of occurences of each word in a string. I decided hashing would be the best way to do it (Find the hash value for each word that is encountered and increment the count at the position indexed by the hash value - assuming i use an array).
You have the right idea. Just use a dictionary mapping string
s to int
. For example, in C# we would use a Dictionary<string, int>
. Something similar to this exists in most modern languages. Let the language/framework handle the issue of collisions and what not for you and just focus on expressing your idea within that language/framework.
Upvotes: 7
Reputation: 133577
Hashed cannot be a one-to-one function that provides an unique output for every input simply because, usually, the codomain of the function is smaller than the domain, so what you are asking is not possible.
Of course if the length of the string is limited and the set of all possible strings is lower than a precise bound you can have what's called perfect hash function.
You can just search for a good hashing function that has a low collision probability, just start from here and have fun!
side note: if I'm not wrong Java Hashtable
doesn't use open addressing. Whenever a collision is found, the element is placed in the same, already occupied, cell through a list. So it's definitely the opposite of what you think.. implmentations doesn't try to guarantee uniqueness, they instead choose a good collision resolution strategy that minimizes some aspects
Upvotes: 2
Reputation: 19707
In Java, hashCode for String
is implemented as follows:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
Using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)
Source: JavaDoc for java.lang.String
You might want to consider using a similar algorithm to make your hashCode bullet proof (mostly).
Upvotes: 0
Reputation: 30481
Theoretically speaking, you cannot guarantee uniqueness for hashes - unless the length of your hash is always as long or longer as the original strings, which is kind of counterproductive.
For a comprehensive explanation on this, please see "Are Hash Codes Unique?" by Tom Archer.
Upvotes: 1
Reputation: 272497
You can't have a hashing algorithm that guarantees uniqueness; it's the pigeonhole principle. Why not use a binary tree?
Upvotes: 3