Reputation: 432
I am learning assembly. So I wrote a routine that returns the square root of its input if the input is non-negative, and it returns 0 otherwise.
I have implemented the routine in both assembly and C, I would like to understand why my C routines compiled with -O2 are much faster than my assembly routine. The disassembled code for the C routines look slightly more complex than my assembly routine, so I don't understand where I am going wrong.
The assembly routine (srt.asm) :
global srt
section .text
srt:
pxor xmm1,xmm1
comisd xmm0,xmm1
jbe P
sqrtsd xmm0,xmm0
retq
P:
pxor xmm0,xmm0
retq
I am compiling the above as
nasm -g -felf64 srt.asm
The C routines (srtc.c)
#include <stdio.h>
#include <math.h>
#include <time.h>
extern double srt(double);
double srt1(double x)
{
return sqrt( (x > 0) * x );
}
double srt2(double x)
{
if( x > 0) return sqrt(x);
return 0;
}
int main(void)
{
double v = 0;
clock_t start;
clock_t end;
double niter = 2e8;
start = clock();
v = 0;
for( double i = 0; i < niter; i++ ) {
v += srt(i);
}
end = clock();
printf("time taken srt = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v);
start = clock();
v = 0;
for( double i = 0; i < niter; i++ ) {
v += srt1(i);
}
end = clock();
printf("time taken srt1 = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v);
start = clock();
v = 0;
for( double i = 0; i < niter; i++ ) {
v += srt2(i);
}
end = clock();
printf("time taken srt2 = %f v=%g\n", (double) (end - start)/CLOCKS_PER_SEC,v);
return 0;
}
The above is compiled as
gcc -g -O2 srt.o -o srtc srtc.c -lm
The output of the program is
time taken srt = 0.484375 v=1.88562e+12
time taken srt1 = 0.312500 v=1.88562e+12
time taken srt2 = 0.312500 v=1.88562e+12
So my assembly routine is significantly slower.
The disassembled C code is
Disassembly of section .text:
0000000000000000 <srt1>:
0: f3 0f 1e fa endbr64
4: 66 0f ef c9 pxor xmm1,xmm1
8: 66 0f 2f c1 comisd xmm0,xmm1
c: 77 04 ja 12 <srt1+0x12>
e: f2 0f 59 c1 mulsd xmm0,xmm1
12: 66 0f 2e c8 ucomisd xmm1,xmm0
16: 66 0f 28 d0 movapd xmm2,xmm0
1a: f2 0f 51 d2 sqrtsd xmm2,xmm2
1e: 77 05 ja 25 <srt1+0x25>
20: 66 0f 28 c2 movapd xmm0,xmm2
24: c3 ret
25: 48 83 ec 18 sub rsp,0x18
29: f2 0f 11 54 24 08 movsd QWORD PTR [rsp+0x8],xmm2
2f: e8 00 00 00 00 call 34 <srt1+0x34>
34: f2 0f 10 54 24 08 movsd xmm2,QWORD PTR [rsp+0x8]
3a: 48 83 c4 18 add rsp,0x18
3e: 66 0f 28 c2 movapd xmm0,xmm2
42: c3 ret
43: 66 66 2e 0f 1f 84 00 data16 nop WORD PTR cs:[rax+rax*1+0x0]
4a: 00 00 00 00
4e: 66 90 xchg ax,ax
0000000000000050 <srt2>:
50: f3 0f 1e fa endbr64
54: 66 0f ef c9 pxor xmm1,xmm1
58: 66 0f 2f c1 comisd xmm0,xmm1
5c: 66 0f 28 d1 movapd xmm2,xmm1
60: 77 0e ja 70 <srt2+0x20>
62: 66 0f 28 c2 movapd xmm0,xmm2
66: c3 ret
67: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
6e: 00 00
70: 66 0f 2e c8 ucomisd xmm1,xmm0
74: 66 0f 28 d0 movapd xmm2,xmm0
78: f2 0f 51 d2 sqrtsd xmm2,xmm2
7c: 76 e4 jbe 62 <srt2+0x12>
7e: 48 83 ec 18 sub rsp,0x18
82: f2 0f 11 54 24 08 movsd QWORD PTR [rsp+0x8],xmm2
88: e8 00 00 00 00 call 8d <srt2+0x3d>
8d: f2 0f 10 54 24 08 movsd xmm2,QWORD PTR [rsp+0x8]
93: 48 83 c4 18 add rsp,0x18
97: 66 0f 28 c2 movapd xmm0,xmm2
9b: c3 ret
Upvotes: 2
Views: 361
Reputation: 432
Peter Cordes comment explains what is happening here. srt1 and srt2 are inlined while srt is not. Quoting Peter Cordes :
Oh right, simply being a non-inline function is the problem. x86-64 System V doesn't have any call-preserved XMM registers, so the add dependency chain through v includes a store/reload for srt(), but not when srt1 or srt2 inline
.
Upvotes: 1