How can I evaluate a hash table implementation? (Using HashMap as reference)

Problem:

I need to compare 2 hash table implementations (well basically HashMap with another one) and make a reasonable conclusion.
I am not interested in 100% accuracy but just being in the right direction in my estimation.
I am interested in the difference not only per operation but mainly on the hashtable as a "whole".
I don't have a strict requirement on speed so if the other implementation is reasonably slower I can accept it but I do expect/require that the memory usage be better (since one of the hashtables is backed by primitive table).

What I did so far:

Originally I created my own custom "benchmark" with loops and many calls to hint for gc to get a feeling of the difference but I am reading online that using a standard tool is more reliable/appropriate.
Example of my approach (MapInterface is just a wrapper so I can switch among implementations.):

int[] keys = new int[10000000];
String[] values = new String[10000000];  
for(int i = 0; i < keys.length; ++i) {  
   keys[i] = i;  
   values[i] = "" + i;
}

if(operation.equals("put", keys, values)) {  
   runPutOperation(map);  
}  

public static long[] runOperation(MapInterface map, Integer[] keys, String[] values) {  
    long min = Long.MAX_VALUE;  
    long max = Long.MIN_VALUE;  
    long run = 0;  
    for(int i = 0; i < 10; ++i) {  
       long start = System.currentTimeMillis();  
       for(int i = 0; i < keys.length; ++i) {          
            map.put(keys[i], values[i]);  
        }
        long total = System.currentTimeMillis() - start;  
        System.out.println(total/1000d + " seconds");    
        if(total < min) {
            min = time;
        }
        if(total > max) {
            max = time;
         }
         run += time;  
         map = null;  
         map = createNewHashMap();
         hintsToGC();    
   }  
  return new long[] {min, max, run};
 }     


public void hintsToGC() {  
    for(int i = 0; i < 20; ++i) {
            System.out.print(". ");
            System.gc();            
            try {
                Thread.sleep(100);
            } catch (InterruptedException e) {              
                e.printStackTrace();
          }           
       } 
}


private HashMapInterface<String> createNewHashMap() {  
    if(jdk) {  
        return new JDKHashMapWrapper<String>();  
    }  
    else {
        return new AlternativeHashMapWrapper<String>();   
    }  
 }  



public class JDKHashMapWrapper implements HashMapInterface<String>  {
    HashMap<Integer, String> hashMap;         
    JDKHashMapWrapper() {   
       hashMap = new HashMap<Integer, String>();  
    }  
    public String put(Integer key, String value)  {
       return hashMap.put(key, value);  
    }  
 //etc  
}

(I want to test put, get, contains and the memory utilization)
Can I be sure by using my approach that I can get reasonable measurements?
If not what would be the most appropriate tool to use and how?

Update:
- I also test with random numbers (also ~10M random numbers) using SecureRandom.
- When the hash table resizes I print the logical size of the hash table/size of the actual table to get the load factor

Update:
For my specific case, where I am interested also in integers what can of pitfalls are there with my approach?

UPDATE after @dimo414 comments:

Well at a minimum the hashtable as a "whole" isn't meaningful

I mean how the hashtable behaves under various loads both at runtime and in memory consumption.

Every data structure is a tradeoff of different methods

I agree. My trade-off is an acceptable access penalty for memory improvement

You need to identify what features you're interested in verifying

1) put(key, value);
2) get(key, value);
3) containsKey(key);
4) all the above when having many entries in the hash table

Upvotes: 5

Answers (3)

Ivan Senic

Reputation: 649

As I understand you are interested in both operations execution time and memory consumption of the maps in the test.

I will start with memory consumption as this seams not to be answered at all. What I propose is to use a small library called Classmexer. I personally used it when I need to get the 100% correct memory consumption of any object. It has the java agent approach (because it's using the Instrumentation API), which means that you need to add it as the parameter to the JVM executing your tests:

-javaagent: [PATH_TO]/classmexer.jar

The usage of the Classmexer is very simple. At any point of time you can get the memory consumption in bytes by executing:

MemoryUtil.deepMemoryUsageOf(mapIamInterestedIn, VisibilityFilter.ALL)

Note that with visibility filter you can specify if the memory calculation should be done for the object (our map) plus all other reachable object through references. That's what VisibilityFilter.ALL is for. However, this would mean that the size you get back includes all objects you used for keys and values. Thus if you have 100 Integer/String entries the reported size will include those as well.

For the timing aspect I would propose JMH tool, as this tool is made for micro bench-marking. There are plenty examples online, for example this article has map testing examples that can guide you pretty good.

Note that I should be careful when do you call the Classmexer's Memory Util as it will interfere with the time results if you call it during the time measuring. Furthermore, I am sure that there are many other tools similar to Classmexer, but I like it because it small and simple.

Upvotes: 1

schtever

Reputation: 3250

Some key consideration for using Hash tables is the size of the "buckets" allocation, the collision resolution strategy, and the shape of your data. Essentially, a Hash table takes the key supplied by the application and then hashes it to a value less than or equal to the number of allocated buckets. When two key values hash to the same bucket, the implementation has to resolve the collision and return the right value. For example, one could have a sorted linked list for each bucket and that list is searched.

If your data happens to have a lot of collisions, then your performance will suffer, because the Hash table implementation will spend too much time resolving the collision. On the other hand, if you have a very large number of buckets, you solve the collision problem at the expense of memory. Also, Java's built-in HashMap implementation will "rehash" if the number of entries gets larger than a certain amount - I imagine this is an expensive operation that is worth avoiding.

Since your key data is the positive integers from 1 to 10M, your test data looks good. I would also ensure that the different hash tables implementations were initialized to the same bucket size for a given test, otherwise it's not a fair comparison. Finally, I would vary the bucket size over a pretty significant range and rerun the tests to see how the implementations changed their behavior.

Upvotes: 1

adamconkey

Reputation: 4745

I was just doing something similar to this, and I ended up using the built in profiler in the Netbeans IDE. You can get really detailed info on both CPU and memory usage. I had originally written all my code in Eclipse, but Netbeans has an import feature for bringing in Eclipse projects and it set it all up no problem, if that is possibly your situation too.

For timing, you might also look at the StopWatch class in Apache Commons. It's a much more intuitive way of tracking time on targeted operations, e.g.:

StopWatch myMapTimer = new StopWatch();
HashMap<Integer, Integer> hashMap = new HashMap<>();

myMapTimer.start();
for (int i = 0; i < numElements; i++)
    hashMap.put(i, i);
myMapTimer.stop();

System.out.println(myMapTimer.getTime()); // time will be in milliseconds

Upvotes: 0

How can I evaluate a hash table implementation? (Using HashMap as reference)

Answers (3)

Related Questions