Reputation: 21551
In the article Linear vs. Binary Search, there is a fast implementation of binary search which uses the CMOV instruction. I would like to implement this in VC++ as the application I'm working on relies on the performance of Binary Search.
The implementation has some GCC inline assembler which is declared as follows:
static int binary_cmov (const int *arr, int n, int key) {
int min = 0, max = n;
while (min < max) {
int middle = (min + max) >> 1;
asm ("cmpl %3, %2\n\tcmovg %4, %0\n\tcmovle %5, %1"
: "+r" (min),
"+r" (max)
: "r" (key), "g" (arr [middle]),
"g" (middle + 1), "g" (middle));
// Equivalent to
// if (key > arr [middle])
// min = middle + 1;
// else
// max = middle;
}
return min;
}
I'd like to convert this GCC Assembler to Microsoft Visual Studio compatible assembler, but as a GCC noob wouldn't know where to start.
Can anyone help, or at least explain the GCC Inline Assembler? TIA!
Upvotes: 4
Views: 1199
Reputation: 23
With VS2015 update 3, i could convince the compiler to use cmova and cmovbe by changing the second comparison from > to <=:
int binary_cmov(const int *arr, int n, int key)
{
int min = 0, max = n;
while (min < max)
// cmp r9,r8
// jb binary_cmov+1B0h
{
int middle = (min + max) >> 1;
// lea rdx,[r8+r9]
// shr rdx,1
int middle1 = middle + 1;
// lea rax,[rdx+4]
min = key > arr[middle] ? middle1 : min;
// cmp r10,dword ptr [rdi+rdx*4]
// cmova r9,rax
max = key <= arr[middle] ? middle : max;
// cmovbe r8,rdx
}
return min;
}
Measuring performance on an i7-5600U with a real application (lookup of breakpoint addresses), linear search out performs binary search up to arrays of 512 entries. I think speed is limited by filling the cache and the binary search > 512 entries mainly performs better because it avoids getting a part of the table from the memory.
The code for the linear pointer search with sentinel is:
// the array must be terminated with a value that is larger than
// any value searched for e.g. MAX_INT (the sentinel)
// the index of the equal or greater entry is returned
int find_in_terminated_array(const int *arr, int key)
{
int * p = arr;
while(key > *p) +p;
return p - arr;
}
Upvotes: 0
Reputation: 39621
Making a small change to Richard Hodges' code also allows Visual Studio 2013 to use CMOV instruction:
int binary_cmov (const int *arr, int n, int key)
{
int min = 0, max = n;
while (min < max)
{
int middle = (min + max) >> 1;
int middle1 = middle + 1;
min = key > arr[middle] ? middle1 : min;
max = key > arr[middle] ? max : middle;
}
return min;
}
Compiling with cl /Ox /Fa /c t286.c
generates:
; Listing generated by Microsoft (R) Optimizing Compiler Version 18.00.40629.0
binary_cmov PROC
mov QWORD PTR [rsp+8], rbx
; Line 3
xor eax, eax
mov ebx, r8d
mov r11d, edx
; Line 4
test edx, edx
jle SHORT $LN9@binary_cmo
$LL2@binary_cmo:
; Line 6
lea r10d, DWORD PTR [r11+rax]
sar r10d, 1
; Line 8
movsxd r8, r10d
lea r9d, DWORD PTR [r10+1]
cmp ebx, DWORD PTR [rcx+r8*4]
; Line 9
cmovg r10d, r11d
cmovg eax, r9d
mov r11d, r10d
cmp eax, r10d
jl SHORT $LL2@binary_cmo
$LN9@binary_cmo:
; Line 12
mov rbx, QWORD PTR [rsp+8]
ret 0
binary_cmov ENDP
This also works with the Visual Studio 2015 and 2010 compilers. With Visual Studio the trick seems to be to use the ternary operators and use simple variables as the second and third operands. If you replace middle1
with middle + 1
in the first ternary operator above then Visual Studio only generates one CMOV instruction for this function. The first ternary operator generates a branch instead.
As mentioned by Yakk in a comment, the forthcoming Visual Stdio 2015 Update 3 compiler contains a major update to the optimizer that should change this behaviour, making it more likely to generate CMOV instructions where appropriate.
Upvotes: 4
Reputation: 7528
People have already pointed out (repeatedly) the fact that you should avoid inline asm if possible. I agree with all of them.
That said, you also asked "at least explain the GCC Inline Assembler." That part may be moot, but FWIW:
Starting with the assembler template:
"cmpl %3, %2\n\tcmovg %4, %0\n\tcmovle %5, %1"
There are a couple things going on here that might look weird to a VS programmer.
cmpl
- By default, gcc's asm uses at&t syntax instead of intel's. This results in a number of things being subtly different. For example, the l
here indicates that the comparison is to be applied to longs (where long in this context means 4 bytes). One of the other differences is that operands are reversed. So intel might use mov eax, 1
, but att would use movl $1, %eax
(dollar sign indicates constant, % marks a register). There are sites that talk about all the differences here.\n\t
- gcc is going to output the code to the assembler. To put each of these 3 instruction on their own lines, you add '\n' for newline and '\t' for a tab to line things up pretty.printf("2 + 2 = %d\n", 2+2);
). In the same way, gcc will replace parts of the template with the following parameters. %0 is the first parameter and %5 is the 6th (numbered in the order they appear).That brings us to the rest of the command. Items that come after the first colon are output parameters:
: "+r" (min),
"+r" (max)
The '+' here indicates that the variables are both read and written. If they were output only, you would use '='. If you were going to read but not modify, you'd put it as a input parameter (ie after the next colon). The 'r' indicates that the values must be moved into registers before executing the asm. It leaves the decision about which register to the compiler. The programmer doesn't need to know; he can use use %0 for min
and %1 for max
and the token will be replace as appropriate.
Which brings us to the output parameters. Pretty much what you expect except that 'g' is a general constraint type. In theory it allows the values to be stored in a register, memory, or an immediate value. Mostly it just means register.
: "r" (key), "g" (arr [middle]),
"g" (middle + 1), "g" (middle));
There are pages of docs about gcc's inline asm (see https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html) that describe all this in painful detail. There are also a few pages that talk about the constraints other than 'r' and 'g' (https://gcc.gnu.org/onlinedocs/gcc/Constraints.html).
As you might imagine, it is incredibly easy to get this wrong. And if you do, you probably won't get a nice, comprehensible compiler error. Instead what will happen is that it will appear to work, then something else a dozen lines later will fail for no obvious reason.
Which is why friends don't tell friends to use inline assembler.
Hey, you asked...
Upvotes: 3
Reputation: 69902
refactoring the code in order to better express intent to the compiler:
int binary_cmov (const int *arr, int n, int key)
{
int min = 0, max = n;
while (min < max)
{
int middle = (min + max) >> 1;
min = key > arr[middle] ? middle + 1 : min;
max = key > arr[middle] ? max : middle;
}
return min;
}
With gcc5.3 -O3, yields:
binary_cmov(int const*, int, int):
xorl %eax, %eax
testl %esi, %esi
jle .L4
.L3:
leal (%rax,%rsi), %ecx
sarl %ecx
movslq %ecx, %r8
leal 1(%rcx), %r9d
movl (%rdi,%r8,4), %r8d
cmpl %edx, %r8d
cmovl %r9d, %eax
cmovge %ecx, %esi
cmpl %eax, %esi
jg .L3
rep ret
.L4:
rep ret
moral of the story - don't embed assembler. All you do is make the code non-portable.
going further...
why not express intent even more explicitly?
#include <utility>
template<class...Ts>
auto sum(Ts...ts)
{
std::common_type_t<Ts...> result = 0;
using expand = int[];
void(expand{ 0, ((result += ts), 0)... });
return result;
}
template<class...Ts>
auto average(Ts...ts)
{
return sum(ts...) / sizeof...(Ts);
}
int binary_cmov (const int *arr, int n, int key)
{
int min = 0, max = n;
while (min < max)
{
int middle = average(min, max);
auto greater = key > arr[middle];
min = greater ? middle + 1 : min;
max = greater ? max : middle;
}
return min;
}
compiler output:
binary_cmov(int const*, int, int):
xorl %eax, %eax
testl %esi, %esi
jle .L4
.L3:
leal (%rax,%rsi), %ecx
movslq %ecx, %rcx
shrq %rcx
movslq %ecx, %r8
leal 1(%rcx), %r9d
movl (%rdi,%r8,4), %r8d
cmpl %edx, %r8d
cmovl %r9d, %eax
cmovge %ecx, %esi
cmpl %eax, %esi
jg .L3
rep ret
.L4:
rep ret
Upvotes: 4