Fastest way to search in sorted static array

Question

I am looking for the fastest way to search in a sorted, fixed array of 32 bit keys. The array size and data is static and will never change. The size of this array is ~1000-10000 unique elements. The search range is significantly broader (~100000) so a lot of searched values will not be found. I am interested in exact matches only.

Here is how the search proceeds:

Generate ~100 keys. These keys are in order of relevance so they can't be simply sorted
Search for the set of ~100 keys in a collection of static arrays (typically between 50 and 300 of them)
Stop the search when we have found enough matching results (hence the importance of not sorting the keys to get the most relevant results)

A potentially interesting property of the keys is that even if they are not close in term of integer value, most of them will only have a few different bits (~1-4) from their closest neighbor.

Most answers I found point towards binary search but none deal with the case of a static array, which probably opens up some optimization possibilities.

I have full control over the data structure, right now it is a fixed, sorted array but I could change that if it's not optimal. I could also add precomputed information since the data doesn't change if it doesn't take an unreasonable amount of memory.

The goal is to be efficient both in CPU and memory although CPU is the priority here.

Using C++ although that probably won't affect the answer much.

user3707125 · Accepted Answer

Considering that your static arrays never change, and that you have infinite pre-processing power I think the best approach would be to create a specific hash function for each of your arrays.

My approach - define a parameterized hash function (code in java):

private static Function createHashFunction(int sz) {
    int mvLeft = ThreadLocalRandom.current().nextInt(30);
    int mvRight = ThreadLocalRandom.current().nextInt(16);
    int mvLeft2 = ThreadLocalRandom.current().nextInt(10);
    int mvRight2 = ThreadLocalRandom.current().nextInt(16);
    int mvLeft3 = ThreadLocalRandom.current().nextInt(16);
    int mvRight3 = ThreadLocalRandom.current().nextInt(20);
    return (key) -> {
        // These operations are totally random, and has no mathematical background beneath them!
        key = ~key + (key << mvLeft);
        key = key ^ (key >>> mvRight);
        key = key + (key << mvLeft2);
        key = key ^ (key >>> mvRight2);
        key = key + (key << mvLeft3);
        key = key ^ (key >>> mvRight3);
        return (int) (Math.abs(key) % sz); // sz is the size of target array
    };
}

For each test array find such a combination of parameters, that max bucket size is the smallest.

Some testing (input array has the size of 10k, filled with random elements):

Hash mapping into [0..262k] results in a bucket of 2 items max. 5k random arrays tested, single-threaded version finds hash functions at ~100 arrays/second rate.

Considering that with the max bucket size of 2 it is possible to map both values into one 64-bit integer, this approach will result in only one memory jump and the simplest operations for CPU - hashing is made through xor, plus and shifts, which should be extremely fast as well as bits comparison.

However your data may not be so good, and may require bucket size of 3, which destroys possibility of long long usage for bucket items. In this case you can try to find some decent hash function instead of the random mess I've written.

Fastest way to search in sorted static array

Answers (1)

Related Questions