Andreea
Andreea

Reputation: 67

Sum even elements of an array in MSVC inline asm

I'm trying to write a function in Assembly that will return the sum of the even elements in an array. This is my code

       int sum(int *a, int n)
       {
          int S = 0;
          _asm {

          mov eax, [ebp + 8]
          mov edx, 0
          mov ebx, [ebp + 12]
          mov ecx, 0
          for1: cmp ecx, ebx
          jge endfor
          and [eax + ecx * 4], 1
          jz even
          inc ecx
          jmp for1

          even: add edx, [eax + ecx * 4]            
          inc ecx
          jmp for1
          endfor: mov S, edx
          }
      return S;
      }

But it's not working. Does anybody know what it's wrong and how can I fix it? Thanks.

Upvotes: 1

Views: 1212

Answers (2)

Peter Cordes
Peter Cordes

Reputation: 364842

Why even use inline asm at all if you want to grab the args off the stack yourself? Either let the compiler give you the args (so it still works after inlining), or write pure asm in a separate file.

Here's a version of your function written nicely. Note the indenting, and the more efficient code (branch structure, and loading into a register instead of comparing in memory and then using as a memory operand for add. If the condition was expected to be very unlikely, a test or cmp with a memory operand would make sense, though.)

int sum(const int *a, int n)
{
  int S;

  _asm {
      mov    ecx, n
      xor    eax,eax              // Sum = 0
      test   ecx,ecx
      jz   .early_out             // n==0 case

      mov    esi, a
      lea    edi, [esi + 4*ecx]   // end pointer = &a[n]
      xor    ecx,ecx              // zeroed register for CMOV
  sum_loop:                   // do{
      // Assume even/odd is unpredictable, so do it branchlessly:
      mov    edx, [esi]
      test   edx, 1               // ZF cleared for odd numbers only
      cmovnz edx, ecx             // zero out EDX for odd numbers only
      add    eax, edx             // add zero or a[i]

      add    esi, 4
      cmp    esi, edi         // while(++pointer < end_pointer)
      jb   sum_loop

  early_out:
      mov S, eax     // MSVC does actually support leaving a value in EAX and falling off the end of a non-void function (without return),
                     // but clang -fasm-blocks doesn't.  And there's no way to explicitly tell the compiler where the output(s) are in this dumb syntax, unlike GNU C inline asm
  }

  return S;
}

With a branch, the simple way is

.sum_loop:
    mov   edx, [esi]
    test  edx, 1
    jnz .odd                    // conditionally skip the add
    add   eax, edx
.odd:
    add    esi, 4
    cmp    esi, edi             // pointer < end pointer
    jb  .sum_loop

Your original branch structure (duplicating the tail of the loop for the odd/even branches) is a useful technique for some cases, but not here where they're identical again after the add to the total.

In asm, write loops with the conditional branch at the bottom, do{}while() style. Check for conditions that imply zero iterations before entering the loop. You don't want a not-taken cmp/jcc and a taken unconditional branch.

lodsd is just as efficient as mov eax, [esi] / add esi, 4 on Intel starting with Haswell. It's 3 uops on earlier Intel, and on AMD, so it saves code size at the expense of speed. In this case that would mean we couldn't have the sum in EAX, although that isn't important because the safe way to get data out of an asm block is to mov-store it, not leave it in the return-value register. MSVC does seem to actually support the semantics of EAX being a return-value register even after inlining a function containing an asm{} block, but I don't know if that's documented. And clang -fasm-blocks certainly doesn't.

test dl, 1 would be slightly more efficient in the current code, fewer instruction bytes (3 instead of 6): there is no test r/m32, sign_extend_imm8, only with a 32-bit immediate. So if you want to test a bit in the low byte, use the low-8-bit partial register. For simplicity I tested the same register we loaded into.

Loading into EAX would actually gain efficiency, allowing test al,1 as a 2-byte instruction instead of any other byte register being 3 bytes. Smaller code-size is typically better, all else equal, unless it happens to cause an unlucky alignment for some later code, e.g. the JCC erratum on Skylake if a macro-fused cmp/jcc touches the end of a 32-byte block.

See the tag wiki for guides and stuff, also https://stackoverflow.com/tags/inline-assembly/info

Upvotes: 0

davmac
davmac

Reputation: 20641

Just some guesswork (especially as I don't know what compiler you are using, and you weren't clear about what you meant by "not working"):

      mov eax, [ebp + 8]
      mov edx, 0
      mov ebx, [ebp + 12]
      mov ecx, 0

I assume this is supposed to load the two parameters a and n into registers. How do you know the offsets are correct? Is it possible to simply refer to the names directly?

      and [eax + ecx * 4], 1

This destroys the element in the input array (setting it to 1 if odd or 0 if even), which you probably don't want. You should probably use the test instruction (which is non-destructive) instead of and.

      even: add edx, [eax + ecx * 4]

This will add 0, since you have set [eax + ecx * 4] to 0 via the and instruction which I mention above. Based on this, I would expect your function to always return 0.

Upvotes: 4

Related Questions