Reputation: 11810
I am trying to optimize a search through a very short sorted array of doubles to locate a bucket a given value
belongs to. Assuming the size of the array is 8 doubles, I have come up with the following sequence of AVX intrinsics:
_data = _mm256_load_pd(array);
temp = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos = _mm_popcnt_u32(temp);
_data = _mm256_load_pd(array+4);
temp = _mm256_movemask_pd(_mm256_cmp_pd(_data, _value, _CMP_LT_OQ));
pos += _mm_popcnt_u32(temp);
To my surprise (I do not have the instruction latency specs in my head..), it turned out that a faster code is generated by gcc for the following C loop:
for(i=0; i<7; ++i) if(array[i+1]>=value) break;
This loop compiles into what I found to be a very efficient code:
lea ecx, [rax+1]
vmovsd xmm1, QWORD PTR [rdx+rcx*8]
vucomisd xmm1, xmm0
jae .L7
lea ecx, [rax+2]
vmovsd xmm1, QWORD PTR [rdx+rcx*8]
vucomisd xmm1, xmm0
jae .L8
[... repeat for all elements of array]
so it takes 4 instructions to check 1 bucket (lea
, vmovsd
, vucomisd
, jae
). Assuming the value
is uniformly spread, on average I will have to check ~3.5 buckets per value
. Apparently, this is enough to outperform the AVX code listed earlier.
Now, in a general case the array may of course be larger than 8 elements. If I code a C loop like this:
for(i=0; u<n-1; i++) if(array[i+1]>=value) break;
I get the following instruction sequence for the loop body:
.L76:
mov eax, edx
.L67:
cmp eax, esi
jae .L77
lea edx, [rax+1]
mov ecx, edx
vmovsd xmm1, QWORD PTR [rdi+rcx*8]
vucomisd xmm1, xmm0
jb .L76
I can tell gcc to unroll the loop, but the point is that the number of instructions per element is larger than in the case of the loop with constant bounds, and the code is slower. Also, I do not understand the reason behind using an additional rcx
register for addressing in vmovsd
.
I can manually modify the assembly for the loop to look something like in the first example, and it does work faster:
.L76:
cmp edx, esi # eax -> edx
jae .L77
lea edx, [rdx+1] # rax -> rdx
vmovsd xmm1, QWORD PTR [rdi+rdx*8]
vucomisd xmm1, xmm0
jb .L76
but I can not seem to make gcc do it. And I know it can - the asm generated in the first example is OK.
Do you have any ideas how to do it otherwise than using inline asm? Or even better - can you suggest a faster implementation of the search?
Upvotes: 4
Views: 376
Reputation: 753
You can try integer comparison. Double comparison is equivalent to int64_t comparison of the same bits with exception for NaNs. It could turn faster. CPU has more integer execution units then SIMD. Just send double* and receive int64_t* in function argument.
Upvotes: 0
Reputation: 7970
Not really an answer, but there's no room in the comments for this.
I tested the AVX function against a simple C implementation and got completely different results. I tested on Windows 7 x64 not Linux but the generated code was very similar. How the test went: 1) I disabled the CPU's SpeedStep. 2) Within main() I raised the process priority and thread priority to the max (realtime). 3) I ran 10M calls to the tested function to heat up the CPU - activate turbo. 4) I called Sleep(0) to avoid a context switch 5) I called __rdtscp to start measurement 6) in a loop I called either the AVX find index function or the simple C version - like you did. the other implementation was commented out and not used. Loop size was 10M calls. 7) I called __rdtscp again to finish the benchmark. 8) I printed ticks/iterations. to get the average tick count for a call
Note: I declared both 'find index' functions as inline and I confirmed in the disassembly that they got inlined. The AVX function and the C functions you described are not identical, the C function return a zero based index and the AVX functio returns a 1 based index. On my system, it took the AVX function 1.1 cycles per iteration and the C function took 4.4 cycles per iteration.
I couldn't force the MSVC compiler to use more than ymm registers :(
Array used:
double A[8] = {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 };
Results (avg. ticks/iter): value = 0.3 (index = 2): AVX: 1.1 | C: 4.4 value = 0.5 (index = 3): AVX: 1.1 | C: 11.1 value = 0.9 (index = 7): AVX: 1.1 | C: 18.1
If the AVX function is corrected to return pos-1, then it will be 50% slower. You can see that the AVX function works in constant time while the trivial C loop function performance depends on the index you're looking for.
Timing with clock() and running 100M yields similar results, AVX is almost x4 faster for the first test. Also note that running longer tests reveal different results, but every time AVX holds a similar advantage.
Upvotes: 1