Curious Sam
Curious Sam

Reputation: 1289

Understanding C x86 optimization of a small function

I have a relatively small C function.

#define ROTL16(x, r) (((x) << (r)) | (x >> (16 - (r))))

void a_inverse(uint16_t *left, uint16_t *right) {
  *right ^= *left;
  *right = ROTL16(*right, 14);
  *left -= *right;
  *left = ROTL16(*left, 7);
}

The function was called by submitting uint16_t pointers that point to low and hight 16-bit of a single uint32_t integer.

uint32_t whole;
uint16_t * left = (uint16_t *)&whole;
uint16_t * right = left + 1;

a_inverse(left, right);

But the function does not perform as expected. So I blindly add the volatile keyword.

#define ROTL16(x, r) (((x) << (r)) | (x >> (16 - (r))))

void a_inverse(volatile uint16_t *left, volatile uint16_t *right) {
  *right ^= *left;
  *right = ROTL16(*right, 14);
  *left -= *right;
  *left = ROTL16(*left, 7);
}

This time it works correctly.

Comparing the output assembly with -O3 I see nothing wrong

This is a non-volatile version

a_inverse:
  .cfi_startproc
  endbr64
  movzwl    (%rsi), %ea
  xorw  (%rdi), %ax
  rorw  $2, %ax
  movw  %ax, (%rsi)
  movzwl    (%rdi), %edx
  subl  %eax, %edx
  movl  %edx, %eax
  rolw  $7, %ax
  movw  %ax, (%rdi)
  ret
  .cfi_endproc

And this is the one with volatile keyword

a_inverse:
  .cfi_startproc
  endbr64
  movzwl    (%rdi), %edx
  movzwl    (%rsi), %eax
  xorl  %edx, %eax
  movw  %ax, (%rsi)
  movzwl    (%rsi), %eax
  movzwl    (%rsi), %edx
  sall  $14, %eax
  shrw  $2, %dx
  orl   %edx, %eax
  movw  %ax, (%rsi)
  movzwl    (%rsi), %edx
  movzwl    (%rdi), %eax
  subl  %edx, %eax
  movw  %ax, (%rdi)
  movzwl    (%rdi), %eax
  movzwl    (%rdi), %edx
  sall  $7, %eax
  shrw  $9, %dx
  orl   %edx, %eax
  movw  %ax, (%rdi)
  ret
  .cfi_endproc

And I don't understand why do the functions behave differently.

Could you help me understand what the reason might be?
The only difference in expected/unexpected results is those volatile usages. Removing either one of the volatile would result in an unexpected result.

Upvotes: 0

Views: 121

Answers (1)

John Bollinger
John Bollinger

Reputation: 180715

The main problem is unlikely to be in function a_inverse() itself, and the fact that volatile-qualifying the arguments changed the behavior is a happenstance if the data are not actually volatile.

In all likelihood, the locus of the problem is actually in the caller. Indeed, in comments, you describe computing the argument pointers something like this:

uint32_t *whole = ...;
uint16_t *left = (uint16_t *) whole;
uint16_t *right = left + 1;

which would presumably be followed up by a call such as

a_inverse(left, right);

. The cast to type uint16_t * is permitted. The permissibility of the pointer arithmetic is debatable, but in practice it is unlikely to be an issue. But because the resulting left and right pointers don't actually point to uint16_t objects, attempting to use them to access the object to which they point violates the strict aliasing rule (C18, paragraph 6.5/7), and therefore produces undefined behavior.

The strict aliasing rule says that an object can be accessed only via an lvalue of compatible type +/- qualification and signedness differences, possibly in a union, or via an lvalue of character type. (Among the key implications of the latter is that you may access any object's representation via a char *.) This allows compilers to use data-type analysis to inform optimization choices.

GCC has a flag, -fno-strict-aliasing, that tells it not to use type analysis to make optimization decisions -- basically, forcing it to assume that any pointer might alias any other. If you don't want to fix the issue then using that flag would likely mitigate it.

On the other hand, I suspect that all the pointer (mis)use is a misguided attempt at optimization. I would rewrite the function without:

uint32_t a_inverse(uint32_t whole) {
    uint16_t left = whole;  // x86 is little-endian
    uint16_t right = whole >> 16;

    right ^= left;
    right  = ROTL16(right, 14);
    left  -= right;
    left   = ROTL16(left, 7);

    return right << 16 + left;
}

... and expect it to be at least as fast, not least because there are no aliasing risks to limit available optimizations.

Upvotes: 4

Related Questions