Data structure to store 2D points clustered near the origin?

Question

I need to use a spatial 2d map for my application. The map usually contains small amount of values in (-200, -200) - (200, 200) rectangle, most of them around (0, 0).
I thought of using hash map but then I need a hash function. I thought of x * 200 + y but then adding (0, 0) and (1, 0) will require 800 bytes for the hash table only, and memory is a problem in my application.
The map is immutable after initial setup so insertion time isn't a matter, but there is a lot of access (about 600 a second) and the target CPU isn't really fast.
What are the general memory/access time trade-offs between hash map and ordinary map(I believe RB-Tree in stl) in small areas? what is a good hash function for small areas?

templatetypedef · Accepted Answer

I think that there are a few things that I need to explain in a bit more detail to answer your question.

For starters, there is a strong distinction between a hash function as its typically used in a program and the number of buckets used in a hash table. In most implementations of a hash function, the hash function is some mapping from objects to integers. The hash table is then free to pick any number of buckets it wants, then maps back from the integers to those buckets. Commonly, this is done by taking the hash code and then modding it by the number of buckets. This means that if you want to store points in a hash table, you don't need to worry about how large the values that your hash function produces are. For example, if the hash table has only three buckets and you produce objects with hash codes 0 and 1,000,000,000, then the first object would hash to the zeroth bucket and the second object would hash to the 1,000,000,000 % 3 = 1st bucket. You wouldn't need 1,000,000,000 buckets. Consequently, you shouldn't worry about picking a hash function like x * 200 + y, since unless your hash table is implemented very oddly you don't need to worry about space usage.

If you are creating a hash table in a way where you will be inserting only once and then spending a lot of time doing accesses, you may want to look into perfect hash functions and perfect hash tables. These are data structures that work by trying to find a hash function for the set of points that you're storing such that no collisions ever occur. They take (expected) O(n) time to create, and can do lookups in worst-case O(1) time. Barring the overhead from computing the hash function, this is the fastest way to look up points in space.

If you were just to dump everything in a tree-based map like most implementations of std::map, though, you should be perfectly fine. With at most 400x400 = 160,000 points, the time required to look up a point would be about lg 160,000 ≈ 18 lookups. This is unlikely to be a bottleneck in any application, though if you really need all the performance you can get the aforementioned perfect hash table is likely to be the best option.

However, both of these solutions only work if the queries you are interested in are of the form "does point p exist in the set or not?" If you want to do more complex geometric queries like nearest-neighbor lookups or finding all the points in a bounding box, you may want to look into more complex data structures like the k-d tree, which supports extremely fast (O(log n)) lookups and fast nearest-neighbor and range searches.

Hope this helps!

Data structure to store 2D points clustered near the origin?

Answers (2)

Related Questions