Reputation: 411
I'm working on vectorizing loops, and GCC is giving me a hard time. When I look at the assembly code it generates, I see a lot of strange lines that I would like to get rid of.
For example, with vectorization, I've learnt that you can avoid a lot of extra assembly lines by giving additionnal information to GCC about array alignment. http://locklessinc.com/articles/vectorize/
Here is my experiment.
#define SIZE 1024
void itwillwork (const uint16_t * a, const uint16_t * b, uint16_t * comp) {
int i = 0;
comp[i]=a[i]|b[i];
}
Generates simple assembly:
.globl _ZN8Test_LUT7performEv
23 _ZN8Test_LUT7performEv:
24 .LFB664:
25 .cfi_startproc
26 0020 488B4710 movq 16(%rdi), %rax
27 0024 488B4F08 movq 8(%rdi), %rcx
28 0028 488B5720 movq 32(%rdi), %rdx
29 002c 0FB700 movzwl (%rax), %eax
30 002f 660B01 orw (%rcx), %ax
31 0032 668902 movw %ax, (%rdx)
32 0035 C3 ret
33 .cfi_endproc
But, even if I was expecting a few extra lines, I am very surprised by what I got after adding a loop :
#define SIZE 1024
void itwillwork (const uint16_t * a, const uint16_t * b, uint16_t * comp) {
int i = 0;
for(i=0;i<SIZE;i++)
comp[i]=a[i]|b[i];
}
Generates this assembly with a lot more lines:
233 _Z10itwillworkPKtS0_Pt:
234 .LFB663:
235 .cfi_startproc
236 0250 488D4210 leaq 16(%rdx), %rax
237 0254 488D4E10 leaq 16(%rsi), %rcx
238 0258 4839F0 cmpq %rsi, %rax
239 025b 410F96C0 setbe %r8b
240 025f 4839CA cmpq %rcx, %rdx
241 0262 0F93C1 setnb %cl
242 0265 4108C8 orb %cl, %r8b
243 0268 743E je .L55
244 026a 4839F8 cmpq %rdi, %rax
245 026d 488D4710 leaq 16(%rdi), %rax
246 0271 0F96C1 setbe %cl
247 0274 4839C2 cmpq %rax, %rdx
248 0277 0F93C0 setnb %al
249 027a 08C1 orb %al, %cl
250 027c 742A je .L55
251 027e 31C0 xorl %eax, %eax
252 .p2align 4,,10
253 .p2align 3
254 .L57:
255 0280 F30F6F0C movdqu (%rsi,%rax), %xmm1
255 06
256 0285 F30F6F04 movdqu (%rdi,%rax), %xmm0
256 07
257 028a 660FEBC1 por %xmm1, %xmm0
258 028e F30F7F04 movdqu %xmm0, (%rdx,%rax)
258 02
259 0293 4883C010 addq $16, %rax
260 0297 483D0008 cmpq $2048, %rax
260 0000
261 029d 75E1 jne .L57
262 029f F3C3 rep ret
263 .p2align 4,,10
264 02a1 0F1F8000 .p2align 3
264 000000
265 .L55:
266 02a8 31C0 xorl %eax, %eax
267 02aa 660F1F44 .p2align 4,,10
267 0000
268 .p2align 3
269 .L58:
270 02b0 0FB70C06 movzwl (%rsi,%rax), %ecx
271 02b4 660B0C07 orw (%rdi,%rax), %cx
272 02b8 66890C02 movw %cx, (%rdx,%rax)
273 02bc 4883C002 addq $2, %rax
274 02c0 483D0008 cmpq $2048, %rax
274 0000
275 02c6 75E8 jne .L58
276 02c8 F3C3 rep ret
277 .cfi_endproc
Both were compiled with gcc 4.8.4 in release mode, -O2 -ftree-vectorize -msse2.
Can somebody help me get rid of those lines? Or, if it's impossible, can you tell me why they are there ?
Update :
I've tried the tricks there http://locklessinc.com/articles/vectorize/, but I get another issue:
#define SIZE 1024
void itwillwork (const uint16_t * a, const uint16_t * b, uint16_t * comp) {
int i = 0;
for(i=0;i<SIZE;i++)
comp[i]=a[i]|b[i];
}
A few assembly lines are generated for this function, I get it. But when I call this function from somewhere else :
itwillwork(a,b,c);
There is no call instruction : the long list of instructions of "itwillwork" (the same as above) are used directly. Am I missing something ? (the "extra lines" are the problem, not the inline call)
Upvotes: 0
Views: 241
Reputation: 3529
You are getting "weird" code because GCC cannot make assumptions about the alignment of your pointers so you can see that it is first performing an alignment test to determine whether it can take the fast path and do 128 bits at a time, or the slow path and do 16 bits at a time.
Additionally, the reason you are finding the code repeated is because the compiler is applying an inlining optimisation. You could disable this with the __attribute((noinline))
spec but if performance is your goal, let the compiler inline it.
If you specify the __restrict
keyword then the compiler will only generate the fast-path code: https://goo.gl/g3jUfQ
However, this does not mean the compiler is going to magically take care of alignment for you so take care of what you pass to the function.
Upvotes: 1