Reputation: 11479
Once again, I have a problem for which I would like to shave off nanoseconds. I have a small, constant array and I would like to search it to see if a given number is a member*.
Input: A 64-bit number n.
Output: True if n is in the array, false if n is not.
What are good techniques for making binary searches fast, given the possibility to optimize for the specific elements and their distribution.
I have an array with around 136 members (though see below: there's some flexibility) to search. The members are not equally distributed through the range: they cluster toward the beginning and end of the range. The input numbers can be assumed to the chosen with uniform probability. It's probably worthwhile to take advantage of this irregularity.
Here's a sample picture of the distribution for the 136-element array. Note that only 12 of the 136 elements are between 1% and 99% of the range; the balance are below 1% or over 99%.
(source: crg4.com)
I assume that branch misprediction will be the largest cost of any implementation. I'd be happy to be proved wrong.
* Actually, I have two arrays. Actually actually, I have a choice of what arrays to use: efficiency suggests that the first should have perhaps 10-40 members, while the second can have no more than (exactly) 136 members. My problem gives real flexibility in selecting sizes, along with limited freedom to decide precisely which members to use. If a method performs better with certain sizes or restrictions, please mention this because I may be able to use it. All things equal, I'd prefer to have the second array as large as possible. For reasons unrelated to the binary search I may need to reduce the size of the second array to <= 135 or <= 66 (this is related to the difficulty of determining the input number, which depends on the array selected).
Here's one of the possible arrays, if it helps in testing ideas. (This pretty well reveals my purpose...!) Don't jump to unwarranted conclusions on the basis of the first few members, though.
0, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040, 1346269, 2178309, 3524578, 5702887, 9227465, 14930352, 24157817, 39088169, 63245986, 102334155, 165580141, 267914296, 433494437, 701408733, 1134903170, 1836311903, 2971215073, 4807526976, 7778742049, 12586269025, 20365011074, 32951280099, 53316291173, 86267571272, 139583862445, 225851433717, 365435296162, 591286729879, 956722026041, 1548008755920, 2504730781961, 4052739537881, 6557470319842, 10610209857723, 17167680177565, 27777890035288, 44945570212853, 72723460248141, 117669030460994, 190392490709135, 308061521170129, 498454011879264, 806515533049393, 1304969544928657, 2111485077978050, 3416454622906707, 5527939700884757, 8944394323791464, 14472334024676221, 23416728348467685, 37889062373143906, 61305790721611591, 99194853094755497, 160500643816367088, 259695496911122585, 420196140727489673, 679891637638612258, 1100087778366101931, 1779979416004714189, 2880067194370816120, 4660046610375530309, 7540113804746346429, 9320093220751060618, 9999984858389672876, 10259680355300795461, 10358875208395550958, 10396764270768694864, 10411236604793371085, 10416764544494255842, 10418876029572233892, 10419682545105283285, 10419990606626453414, 10420108275656914408, 10420153221227127261, 10420170388907304826, 10420176946377624668, 10420179451108406629, 10420180407830432670, 10420180773265728832, 10420180912849591277, 10420180966165882450, 10420180986530893524, 10420180994309635573, 10420180997280850646, 10420180998415753816, 10420180998849248253, 10420180999014828394, 10420180999078074380, 10420180999102232197, 10420180999111459662, 10420180999114984240, 10420180999116330509, 10420180999116844738, 10420180999117041156, 10420180999117116181, 10420180999117144838, 10420180999117155784, 10420180999117159965, 10420180999117161562, 10420180999117162172, 10420180999117162405, 10420180999117162494, 10420180999117162528, 10420180999117162541, 10420180999117162546, 10420180999117162548
I will initially run the program on a Phenom II x4, but optimizations for other architectures are welcome.
Upvotes: 5
Views: 1932
Reputation:
An interesting alternative is to use a modified (bidirectional) version of linear search.
The pseudocode is easier than the code:
if the value is greater than or equal to 2^63
linear search from middle to end
otherwise (if the value is less than 2^63)
linear search from middle to beginning (must go in reverse direction)
My code is pretty hacky, but you can make it look more elegant:
int in_set(unsigned long long value)
{
const unsigned long long max_bit_mask = (1 << 63);
if(value & max_bit_mask) //if max bit is set, use linear search from middle (50% probability)
{
unsigned long long *ullp = data + 91; //WARNING: ullp should point to data + 92 on first dereference due to prefix increment
while(*++ullp < value); //FIXME: array must be capped with ULLONG_MAX on the right
return *ullp == value; //&& ullp != right_sentinel
}
//otherwise use reverse linear search from middle
unsigned long long *ullp = data + 92; //WARNING: ullp should point to data + 91 on first dereference due to prefix decrement
while(*--ullp > value); //WARNING: array must be capped with zero on the left
return *ullp == value; //&& ullp != left_sentinel
}
For random inputs to this particular Fibonacci distribution array, the code will be a O(1) algorithm on average (O(k) in the worst case where k is the table size).
You can use Steve Jessop's lookup table strategy to cache the index of the element closest to the center of the array with a particular 8 bit signature, but it might make the code slower, since you are adding an extra O(1) time factor with extra branch misprediction.
const int lookup_table[256] = {/*...*/};
unsigned long long *ullp = data + lookup[value >> 56];
//if, while, and returns as before
One implementation issue with this strategy is you have to pad with the minimum and maximum values on the left and right sides of the array or errors will happen eventually (in this case 0 and ULLONG_MAX). For zero, this isn't a problem, but for ULLONG_MAX the value is not in the set and thus additional logic is required. In this particular case sentinel values must be used and properly accounted for.
edit: the time complexity is O(1) in the average case because each advance of the cursor has a ~40% (1/phi^2) chance of returning, the expected cost is the summation of n/phi^(n+1) for n >= 1, which equates to 1/(phi-1)^2 (~2.6 cursor advances)
edit2: this code should have far less branch misprediction than a binary search and should be faster on average as long as queries have "uniform integer distribution".
Upvotes: 0
Reputation: 43426
Since you know the array data at compile time, you might consider using a hash instead of a binary search. A carefully chosen hash could be faster, especially if you can find a simple hash function that is collision-free for your data, and fits within your memory constraints.
Edit: to further explain...
You hash the 64-bit value to a smaller value, which is an index into an array. In your case, you would aim to have a collision-free hash, so your array would just be an array of a) for hits, the 1 valid value that hashes to that array index, or b) for misses, an invalid value.
You pick the hash function that suits your purposes. In your case, the main parameters are:
Assuming no collisions, you use it at run-time as follows:
If your hash function produces collisions, your choices are:
Upvotes: 3
Reputation:
Since there are 136 members only, on a newish PC I would do a brute force search using 128-bit instructions and prefetch.
If there were thousands of members, I would echo Steve Jessop's idea of a 256-entry lookup table above, but with a function f(x) applied to the lookup value which purpose is to evenly distribute the members into 256 buckets.
f(x) could be some kind of polynomial, similar to "smoothstep" from the world of graphics but perhaps not as smooth.
Upvotes: 0
Reputation: 47954
A normal binary search would iterate at most log_2(n) times. Each iteration would typically have three comparisons (Are we done? Is the number higher? Is it lower?). That's three chances to fail branch prediction on each iteration.
If you unrolled the binary search (which is feasible because your array is so small and the values are known ahead of time), you could eliminate the "are we done?" comparisons, and your typical base would go from 3*log_2(n) to 2*log_2(n). This is fewer instructions executed and fewer chances for a missed branch prediction. But it's also more total instructions (and thus less cache friendly). You'd have to profile to see if this would help on balance.
You could write a quick program to generate the unrolled search function rather than unrolling it by hand.
Perhaps profile-guided optimization on the unrolled search could help further by taking advantage of the uneven distribution.
Upvotes: 1
Reputation: 279225
As a very simple possible optimization, create a 256-entry lookup table for the most significant 8 bits of your 64 bit value. Each row of the table stores indexes in the actual array of the lower and upper bounds of values with those most significant 8 bits. You only need to search this region of the array.
If your array values were uniformly distributed, all the regions would be about the same length, and this wouldn't provide much gain (if any), it's not much different from an interpolation search. Since your values are so skewed, most of the 256 entries will point to very short regions (near the middle) which are fast to binary search, or even 0-sized regions. 2 or 3 entries at each end will point to much larger regions of the array, which then will take relatively longer to search through (almost as long as a binary search of the whole array). Since your inputs are uniformly distributed, the average time spent searching will be reduced, and hopefully this reduction is greater than the cost of the initial lookup. Your worst-case might well end up slower, though.
To refine this, you might have a 2-level lookup table on 4 bits at a time. The first level either says "search between these indices", or else "look up the next 4 significant bits in this second-level table". The former is fine for middling values, where 16 times the value-range still corresponds to a very small index-range, and so is still quick to search. The latter would be for the ends of the range where the search space is larger. Total size of tables would be smaller, which may or may not give better performance due to better caching of less data. The tables themselves could be generated at runtime, or at compile-time if you're willing to generate C code once the array values are known. You could even code the lookup table as a giant switch-statement from hell, just to see if it speeds things up or not.
If you haven't already, you should also benchmark an interpolation search rather than a simple binary chop once you start searching in the array.
Note that I've worked to reduce the number of comparisons made in the binary search, rather than specifically the number of branch mispredictions. The two are sort of proportional anyway - you can't avoid that each time you halve the possibilities in a binary search, you'll get a misprediction in something like 50% of cases. If you really wanted to minimize mispredictions, then a linear search guarantees only one misprediction per lookup (the one that breaks the loop). That ain't faster in general, but you could experiment to see whether there's a size for the remaining array to be searched, below which you should switch to a linear search, perhaps unrolled, perhaps fully unrolled. There may be some other much cleverer hybrid linear/binary search that can be tuned for the relative cost of a successful vs. unsuccessful comparison, but if so I don't know it.
Upvotes: 2
Reputation: 272467
If all you're interested in is member/not-member, rather than location, you could eliminate some conditional branches with the following arrangement:
bool b = false;
b |= (n == x[i]);
b |= (n == x[i+1]);
// ... etc. ...
Clearly, you probably don't want to do this for all 136 entries. But there may be a sweet spot where you are able to mix a coarse-grained binary search to first locate which batch of e.g. 4 elements n
could be in, and then switch to the above approach.
Upvotes: 3
Reputation: 23550
Interesting problem. At first I thought this were one of the many problems that can be most efficiently be handled by a red-black tree mapping ascending values to indices.
However, you have pointed out that the distribution is so non-uniform. If we go at this first from the human perspective: what would a human do, he/she would probably first check whether the given value is below the 0.01 quantile, then whether it is above the 0.99 quantile, and by this strategy already have limited the search space by 49/50th only having done 2 comparisons.
Further iterations in the 0.01-0.99 range are rare (these numbers are the 0...1 mapping of the e.g. 64bit value-space)
Further iterations in the 0.0-0.01 and 0.99-1.0 range dont need to go deep because they are already close to the correct value.
So, how can we generalize this? We dont need to. You have noticed that 0.0-0.01 is frequent and 0.99-1.0 of the value-space is frequent; but this is probably not the case in the second iteration and definitely not the case in the third iteration (you will not find the same distribution inside the 0.0-0.01-range of values as you did in the complete range).
I would do it like this: A jumptable with 3 targets dependent on in which of the 3 regions the value lies, then red-black-trees for each of the regions.
Upvotes: 0