Raj
Raj

Reputation: 4452

hashing algorithm for strings

I came across a situation where i had to count the number of occurences of each word in a string. I decided hashing would be the best way to do it (Find the hash value for each word that is encountered and increment the count at the position indexed by the hash value - assuming i use an array). What hashing algorithm can i use to ensure that the hash value generated for every string is unique?

And this led to a bigger question.. How do the language libraries (Java for example) implement data structures like hashmap that generate unique hash values in case of strings?

I want to know the mathematical construct involved behind the implementation of such an algorithm.

Upvotes: 0

Views: 5434

Answers (8)

yadab
yadab

Reputation: 2143

I think what you are looking for is Substring Index or String search. Am i missing something?

Upvotes: 0

Colin Hebert
Colin Hebert

Reputation: 93167

You can't be 100% sure, a hash by definition can have collisions.

You can see on grepcode how a String is hashed in java. And basically HashMap (and other hash based structures) use the hashCode() method every time.

So if you want to count the number of iterations of a particular word, you should use a Map<String, Integer> (in java) and count from there.

For example :

Map<String, Integer> words = new HashMap<String, Integer>();
String word = "lol";

Integer count = words.get(word);
if(count == null){
    count = 0;
}
words.put(word, count + 1);

Upvotes: 1

jason
jason

Reputation: 241641

What hashing algorithm can i use to ensure that the hash value generated for every string is unique?

There is no such function. The space of strings is infinite, but the target space is finite (say you are using 32-bit integers). You can not injectively map an infinite space to a finite space; there must be collisions.

How do the language libraries (Java for example) implement data structures like hashmap that generate unique hash values in case of strings?

They don't; there is not a unique hashing function for strings per the above.

I came across a situation where i had to count the number of occurences of each word in a string. I decided hashing would be the best way to do it (Find the hash value for each word that is encountered and increment the count at the position indexed by the hash value - assuming i use an array).

You have the right idea. Just use a dictionary mapping strings to int. For example, in C# we would use a Dictionary<string, int>. Something similar to this exists in most modern languages. Let the language/framework handle the issue of collisions and what not for you and just focus on expressing your idea within that language/framework.

Upvotes: 7

Jack
Jack

Reputation: 133577

Hashed cannot be a one-to-one function that provides an unique output for every input simply because, usually, the codomain of the function is smaller than the domain, so what you are asking is not possible.

Of course if the length of the string is limited and the set of all possible strings is lower than a precise bound you can have what's called perfect hash function.

You can just search for a good hashing function that has a low collision probability, just start from here and have fun!

side note: if I'm not wrong Java Hashtable doesn't use open addressing. Whenever a collision is found, the element is placed in the same, already occupied, cell through a list. So it's definitely the opposite of what you think.. implmentations doesn't try to guarantee uniqueness, they instead choose a good collision resolution strategy that minimizes some aspects

Upvotes: 2

Mike
Mike

Reputation: 19707

In Java, hashCode for String is implemented as follows:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)

Source: JavaDoc for java.lang.String

You might want to consider using a similar algorithm to make your hashCode bullet proof (mostly).

Upvotes: 0

jsalonen
jsalonen

Reputation: 30481

Theoretically speaking, you cannot guarantee uniqueness for hashes - unless the length of your hash is always as long or longer as the original strings, which is kind of counterproductive.

For a comprehensive explanation on this, please see "Are Hash Codes Unique?" by Tom Archer.

Upvotes: 1

Oliver Charlesworth
Oliver Charlesworth

Reputation: 272497

You can't have a hashing algorithm that guarantees uniqueness; it's the pigeonhole principle. Why not use a binary tree?

Upvotes: 3

Related Questions