Vikash Balasubramanian
Vikash Balasubramanian

Reputation: 3233

Deciding when to use a hash table

I was soving a competitive programming problem with the following requirements:

I had to maintain a list of unqiue 2d points (x,y), the number of unique points would be less than 500.

My idea was to store them in a hash table (C++ unordered set to be specific) and each time a node turned up i would lookup the table and if the node is not already there i would insert it.

I also know for a fact that i wouldn't be doing more than 500 lookups. So i saw some solutions simply searching through an array (unsorted) and checking if the node was already there before inserting.

My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?

Upvotes: 1

Views: 963

Answers (3)

My question is is there any reasonable way to guess when should i use a hash table over a manual search over keys without having to benchmark them?

I am guessing you are familiar with basic algorithmics & time complexity and C++ standard containers and know that with luck hash table access is O(1)

If the hash table code (or some balanced tree code, e.g. using std::map - assuming there is an easy order on keys) is more readable, I would prefer it for that readability reason alone.

Otherwise, you might make some guess taking into account the approximate timing for various operations on a PC. BTW, the entire http:///norvig.com/21-days.html page is worth reading.

Basically, memory accesses are much more slow than everything else in the CPU. The CPU cache is extremely important. A typical memory access with cache fault requiring fetching data from DRAM modules is several hundreds times slower than some elementary arithmetic operation or machine instruction (e.g. adding two integers in registers).

In practice, it does not matter that much, as long as your data is tiny (e.g. less than a thousand elements), since in that case it is likely to sit in L2 cache.

Searching (linearly) in an array is really fast (since very cache friendly), up to several thousand of (small) elements.

IIRC, Herb Sutter mentions in some video that even inserting an element inside a vector is practically -but unintuitively- faster (taking into account the time needed to move slices) than inserting it into some balanced tree (or perhaps some other container, e.g. an hash table), up to a container size of several thousand small elements. This is on typical tablet, desktop or server microprocessor with a multimegabyte cache. YMMV.

If you really care that much, you cannot avoid benchmarking.

Notice that 500 pairs of integers is probably fitting into the L1 cache!

Upvotes: 3

CrossLuna
CrossLuna

Reputation: 31

My rule of thumb is to assume the processor can deal with 10^9 operations per second.

In your case there are only 500 entries. An algorithm up to O(N^2) could be safe. By using contiguous data structure like vector you can leverage the fast cache hit. Also hash function sometimes can be costly in terms of constant. However if you have a data size of 10^6, the safe complexity might be only O(N) in total. In this case you might need to consider O(1) hashmap for a single lookup.

Upvotes: 3

CShark
CShark

Reputation: 1563

You can use Big O Complexity to roughly estimate the performance. For the Hash Table, Searching an element is between O(1) and O(n) in the worst case. That means, that in the best case your access time is independant of the number of elements in your map but in the worst case it is linear dependant on the size of your hash table.

A Binary tree has a guaranteed search complexity of O(nlog(n)). That means, that searching an element always depends on the size of the array, but in the Worst Case its faster than a hash table.

You can look up some Big O Complexities at this handy website here: http://bigocheatsheet.com/

Upvotes: 0

Related Questions