Utundu
Utundu

Reputation: 411

GCC generates undesired assembly code

I'm working on vectorizing loops, and GCC is giving me a hard time. When I look at the assembly code it generates, I see a lot of strange lines that I would like to get rid of.

For example, with vectorization, I've learnt that you can avoid a lot of extra assembly lines by giving additionnal information to GCC about array alignment. http://locklessinc.com/articles/vectorize/

Here is my experiment.

#define SIZE 1024
void itwillwork (const uint16_t * a, const  uint16_t * b, uint16_t * comp) {
    int i = 0;
    comp[i]=a[i]|b[i];
}

Generates simple assembly:

.globl  _ZN8Test_LUT7performEv
  23                _ZN8Test_LUT7performEv:
  24                .LFB664:
  25                    .cfi_startproc
  26 0020 488B4710      movq    16(%rdi), %rax
  27 0024 488B4F08      movq    8(%rdi), %rcx
  28 0028 488B5720      movq    32(%rdi), %rdx
  29 002c 0FB700        movzwl  (%rax), %eax
  30 002f 660B01        orw (%rcx), %ax
  31 0032 668902        movw    %ax, (%rdx)
  32 0035 C3            ret
  33                    .cfi_endproc

But, even if I was expecting a few extra lines, I am very surprised by what I got after adding a loop :

#define SIZE 1024
void itwillwork (const uint16_t * a, const  uint16_t * b, uint16_t * comp) {
    int i = 0;
    for(i=0;i<SIZE;i++)
        comp[i]=a[i]|b[i];
}

Generates this assembly with a lot more lines:

 233                _Z10itwillworkPKtS0_Pt:
 234                .LFB663:
 235                    .cfi_startproc
 236 0250 488D4210      leaq    16(%rdx), %rax
 237 0254 488D4E10      leaq    16(%rsi), %rcx
 238 0258 4839F0        cmpq    %rsi, %rax
 239 025b 410F96C0      setbe   %r8b
 240 025f 4839CA        cmpq    %rcx, %rdx
 241 0262 0F93C1        setnb   %cl
 242 0265 4108C8        orb %cl, %r8b
 243 0268 743E          je  .L55
 244 026a 4839F8        cmpq    %rdi, %rax
 245 026d 488D4710      leaq    16(%rdi), %rax
 246 0271 0F96C1        setbe   %cl
 247 0274 4839C2        cmpq    %rax, %rdx
 248 0277 0F93C0        setnb   %al
 249 027a 08C1          orb %al, %cl
 250 027c 742A          je  .L55
 251 027e 31C0          xorl    %eax, %eax
 252                    .p2align 4,,10
 253                    .p2align 3
 254                .L57:
 255 0280 F30F6F0C      movdqu  (%rsi,%rax), %xmm1
 255      06
 256 0285 F30F6F04      movdqu  (%rdi,%rax), %xmm0
 256      07
 257 028a 660FEBC1      por %xmm1, %xmm0
 258 028e F30F7F04      movdqu  %xmm0, (%rdx,%rax)
 258      02
 259 0293 4883C010      addq    $16, %rax
 260 0297 483D0008      cmpq    $2048, %rax
 260      0000
 261 029d 75E1          jne .L57
 262 029f F3C3          rep ret
 263                    .p2align 4,,10
 264 02a1 0F1F8000      .p2align 3
 264      000000
 265                .L55:
 266 02a8 31C0          xorl    %eax, %eax
 267 02aa 660F1F44      .p2align 4,,10
 267      0000
 268                    .p2align 3
 269                .L58:
 270 02b0 0FB70C06      movzwl  (%rsi,%rax), %ecx
 271 02b4 660B0C07      orw (%rdi,%rax), %cx
 272 02b8 66890C02      movw    %cx, (%rdx,%rax)
 273 02bc 4883C002      addq    $2, %rax
 274 02c0 483D0008      cmpq    $2048, %rax
 274      0000
 275 02c6 75E8          jne .L58
 276 02c8 F3C3          rep ret
 277                    .cfi_endproc

Both were compiled with gcc 4.8.4 in release mode, -O2 -ftree-vectorize -msse2.

Can somebody help me get rid of those lines? Or, if it's impossible, can you tell me why they are there ?

Update :

I've tried the tricks there http://locklessinc.com/articles/vectorize/, but I get another issue:

#define SIZE 1024
void itwillwork (const uint16_t * a, const  uint16_t * b, uint16_t * comp) {
    int i = 0;
    for(i=0;i<SIZE;i++)
        comp[i]=a[i]|b[i];
}

A few assembly lines are generated for this function, I get it. But when I call this function from somewhere else :

itwillwork(a,b,c);

There is no call instruction : the long list of instructions of "itwillwork" (the same as above) are used directly. Am I missing something ? (the "extra lines" are the problem, not the inline call)

Upvotes: 0

Views: 241

Answers (1)

Olipro
Olipro

Reputation: 3529

You are getting "weird" code because GCC cannot make assumptions about the alignment of your pointers so you can see that it is first performing an alignment test to determine whether it can take the fast path and do 128 bits at a time, or the slow path and do 16 bits at a time.

Additionally, the reason you are finding the code repeated is because the compiler is applying an inlining optimisation. You could disable this with the __attribute((noinline)) spec but if performance is your goal, let the compiler inline it.

If you specify the __restrict keyword then the compiler will only generate the fast-path code: https://goo.gl/g3jUfQ

However, this does not mean the compiler is going to magically take care of alignment for you so take care of what you pass to the function.

Upvotes: 1

Related Questions