user1411135
user1411135

Reputation: 31

Hashcode for strings that can be converted to integer

I'm looking for the most effective way of creating hashcodes for a very specific case of strings.

I have strings that can be converted to integer, they vary from 1 to 10,000, and they are very concentrated on the 1-600 range.

My question is what is the most effective way, in terms of performance for retrieving the items from a collection to implement the hashcode for it.

What I'm thinking is:

Any other ideas are greatly appreciated.

thanks a lot

Thanks everyone for your promptly replies...

There is another information Tha i've forget to add on this. I tink it Will Make this clear if I let you know my final goal with this-I migh not even need a hash table!!!

I just want to validate a stream against a dictiory that is immutable. I want to check if a given tag might or might not be present on my message.

I will receive a string with several pairs tag=value. I want to verify if the tag must or must not be treated by my app.

Upvotes: 2

Views: 2763

Answers (4)

Philippe Marschall
Philippe Marschall

Reputation: 4604

«Insert disclaimer about only doing this when it's a hot spot in your application and you can prove it»

Well the integer value itself will be a perfect hash function, you will not get any collisions. However there are two problems with this approach:

  1. HashMap doesn't allow you to specify a custom hash function. So either you'll have to implement you own HashMap or you use a wrapper object.
  2. HashMap uses a bitwise and instead of a modulo operation to find the bucket. This obviously throws bits away since it's just a mask. java.util.HashMap.hash(int) tries to compensate for this but I have seen claims that this is not very successful. Again we're back to implementing your own HashMap.

Now that this point since you're using the integer value as a hash function why not use the integer value as a key in the HashMap instead of the string? If you really want optimize this you can write a hash map that uses int instead of Integer keys or use TIntObjectHashMap from trove.

If you're really interested in finding good hash functions I can recommend Hashing in Smalltalk, just ignore the half dozen pages where the author rants about Java (disclaimer: I know the author).

Upvotes: 0

sjr
sjr

Reputation: 9875

If there are only a limited valid range of values, why not represent the collection as a int[10000] as you suggested? The value at array[x] is the number of times that x occurs.

If your strings are represented as decimal integers, then parsing them to strings is a 5-iteration loop (up to 5 digits) and a couple of additions and subtractions. That is, it is incredibly fast. Inserting the elements is effectively O(1), retrieval is O(1). Memory required is around 40kb (4 bytes per int).

One problem is that insertion order is not preserved. Maybe you don't care.

Maybe you could think about caching the hashcode and only updating it if your collection has changed since the last time hashcode() was called. See Caching hashes in Java collections?

Upvotes: 0

user949300
user949300

Reputation: 15729

Many collections (e.g. HashMap) already apply a supplemental "rehash" method to help with poor hashcode algorithms. e.g. browse the cource code for HashMap.hash(). And Strings are very common keys, so you can be sure that String.hashCode() is highly optimized. SO, unless you notice a lot of collisions between your hashCodes, I'd go with the standard code.

I tried putting the Strings for 0..600 into a HashSet to see what happened, but it's then pretty tedious to see how many entries had collisions. Look for yourself! If you really really care, copy the source code from HashMap into your own class, edit it so you can get access to the entries (in the Java 6 source code I'm looking at, that would be transient Entry[] table, YMMV), and add methods to count collisions.

Upvotes: 1

pamphlet
pamphlet

Reputation: 2104

You might want to consider a trie (http://en.wikipedia.org/wiki/Trie) or radix tree (http://en.wikipedia.org/wiki/Radix_tree). No need to parse the string into an integer, or compute a hash code. You're walking a tree as you walk the string.

Edit:

Both computing a hash code on a string and parsing an integer out of a string involve walking the entire the string, and THEN using that value as a look-up into a specific data structure. Other techniques might involve simultaneously inspecting the string WHILE traversing a data structure. This MIGHT be of value to the poster who asked for "other ideas".

Upvotes: 1

Related Questions