searching through a short sorted array of doubles

I am trying to optimize a search through a very short sorted array of doubles to locate a bucket a given value belongs to. Assuming the size of the array is 8 doubles, I have come up with the following sequence of AVX intrinsics:

_data = _mm256_load_pd(array);
temp  = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos   = _mm_popcnt_u32(temp);
_data = _mm256_load_pd(array+4);
temp  = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos  += _mm_popcnt_u32(temp);

To my surprise (I do not have the instruction latency specs in my head..), it turned out that a faster code is generated by gcc for the following C loop:

for(i=0; i<7; ++i) if(array[i+1]>=value) break;

This loop compiles into what I found to be a very efficient code:

lea ecx, [rax+1]
vmovsd  xmm1, QWORD PTR [rdx+rcx*8]
vucomisd    xmm1, xmm0
jae .L7

lea ecx, [rax+2]
vmovsd  xmm1, QWORD PTR [rdx+rcx*8]
vucomisd    xmm1, xmm0
jae .L8

[... repeat for all elements of array]

so it takes 4 instructions to check 1 bucket (lea, vmovsd, vucomisd, jae). Assuming the value is uniformly spread, on average I will have to check ~3.5 buckets per value. Apparently, this is enough to outperform the AVX code listed earlier.

Now, in a general case the array may of course be larger than 8 elements. If I code a C loop like this:

for(i=0; u<n-1; i++) if(array[i+1]>=value) break;

I get the following instruction sequence for the loop body:

.L76:
mov eax, edx
.L67:
cmp eax, esi
jae .L77
lea edx, [rax+1]
mov ecx, edx
vmovsd  xmm1, QWORD PTR [rdi+rcx*8]
vucomisd    xmm1, xmm0
jb  .L76

I can tell gcc to unroll the loop, but the point is that the number of instructions per element is larger than in the case of the loop with constant bounds, and the code is slower. Also, I do not understand the reason behind using an additional rcx register for addressing in vmovsd.

I can manually modify the assembly for the loop to look something like in the first example, and it does work faster:

.L76:
cmp edx, esi            # eax -> edx
jae .L77
lea edx, [rdx+1]        # rax -> rdx
vmovsd  xmm1, QWORD PTR [rdi+rdx*8]
vucomisd    xmm1, xmm0
jb  .L76

but I can not seem to make gcc do it. And I know it can - the asm generated in the first example is OK.

Do you have any ideas how to do it otherwise than using inline asm? Or even better - can you suggest a faster implementation of the search?

Upvotes: 4

Answers (2)

Aleksandr Pakhomov

Reputation: 753

You can try integer comparison. Double comparison is equivalent to int64_t comparison of the same bits with exception for NaNs. It could turn faster. CPU has more integer execution units then SIMD. Just send double* and receive int64_t* in function argument.

Upvotes: 0

egur

Reputation: 7970

Not really an answer, but there's no room in the comments for this.

I tested the AVX function against a simple C implementation and got completely different results. I tested on Windows 7 x64 not Linux but the generated code was very similar. How the test went: 1) I disabled the CPU's SpeedStep. 2) Within main() I raised the process priority and thread priority to the max (realtime). 3) I ran 10M calls to the tested function to heat up the CPU - activate turbo. 4) I called Sleep(0) to avoid a context switch 5) I called __rdtscp to start measurement 6) in a loop I called either the AVX find index function or the simple C version - like you did. the other implementation was commented out and not used. Loop size was 10M calls. 7) I called __rdtscp again to finish the benchmark. 8) I printed ticks/iterations. to get the average tick count for a call

Note: I declared both 'find index' functions as inline and I confirmed in the disassembly that they got inlined. The AVX function and the C functions you described are not identical, the C function return a zero based index and the AVX functio returns a 1 based index. On my system, it took the AVX function 1.1 cycles per iteration and the C function took 4.4 cycles per iteration.

I couldn't force the MSVC compiler to use more than ymm registers :(

Array used:

double A[8] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 };

Results (avg. ticks/iter): value = 0.3 (index = 2): AVX: 1.1 | C: 4.4 value = 0.5 (index = 3): AVX: 1.1 | C: 11.1 value = 0.9 (index = 7): AVX: 1.1 | C: 18.1

If the AVX function is corrected to return pos-1, then it will be 50% slower. You can see that the AVX function works in constant time while the trivial C loop function performance depends on the index you're looking for.

Timing with clock() and running 100M yields similar results, AVX is almost x4 faster for the first test. Also note that running longer tests reveal different results, but every time AVX holds a similar advantage.

Upvotes: 1

searching through a short sorted array of doubles

Answers (2)

Related Questions