How do I shift left in assembly with ADD, not using SHL?

Reputation: 365237

I know I need to use the binary values of these hex values to shift them to the left by adding 1 or something like that.

Hex is a human readable way to print binary values. Your bytes in registers and memory are just made of bits because computers use binary logic.

Since you can't use shift or rotate instructions (other than left-shift by 1 using add same,same), I'd consider using store / reload to a temporary get bytes in the desired order.

Notice that the desired result only reorders data with byte granularity, i.e. moving around whole bytes. No bits have to move between bytes. So you don't necessarily need bit-shifts.

Your idea isn't wrong but it requires a lot of instructions. I'm just presenting other ways.

The point of the exercise seems to be teaching you about how AX, AH, and AL alias onto parts of EAX. (And similar for ECX, EDX, and EBX). And/or how a dword load/store affects the 4 individual bytes of memory it touches, i.e. little-endian byte order.

I don't really see the point of forbidding shl eax,8 or 16, if the intended solution is to use add 16 times as a shift. There's still the issue of using partial registers to modify bytes of EAX. I mean, using the x+x = 2*x = x<<1 identity is often useful, e.g. doing math in addressing modes with LEA like x*3 = lea eax, [eax + eax*2]. Or using add same,same as a more efficient shl eax,1.

I successfully completed this part, I just provided this for context. The next part is where I am having issues. In part two we need to move these vars into register EAX. var1 must be stored in the highest byte of eax [...]

Assuming the vars are stored in the order shown (with var1 first, thus at the lowest address:

The order you specified is the opposite of what you'd get from a 4-byte load that covered all 4 bytes: x86 is little-endian so the lowest-address byte ends up in the low byte of EAX (aka AL).

Normally you'd do a dword load then bswap eax to reverse the order of the 4 bytes in the register. (Or on newer CPUs that support it, a movbe load: mov big-endian data is usually at least as efficient as separate load + bswap.) But the exercise forces you to get fancier.

Wider load/store that spans multiple byte vars

Assuming your 4 vars are declared contiguously, with var1 first (at the lowest address), you load multiple bytes at once with a word or dword mov load. e.g.

mov   eax,  dword ptr [var1]       ; AL=var1, AH=var2, higher bytes = var3, var4

MASM associates a size with data labels, so to keep it happy we have to apply the dword ptr size override to the memory operand to make it match the dword size of the register EAX. mov always requires both operands to be the same size, but for a case like mov [esp], eax the register source operand EAX implies the size for the memory destination. Without a label (or 2 registers) there is no possibility of mismatch. It's also optional when the size matches on its own, like for mov al, [var1]

As far as the CPU is concerned everything is just bytes; the dword ptr override is purely a source-level thing to keep the assembler happy. Other assemblers like NASM wouldn't require it. And it's only because the addressing mode includes a data label that MASM considers a "variable" with a size associated with it.

Given that you've already done part1 and stored that to memory, so you have 44h, 41h, 42h, 43h in memory in that order (i.e. 43'42'41'44h as a little-endian dword) you might go about part 2 something like this:

    mov  ax, word ptr [var1]     ; AX = 41'44h = AH:AL
    mov  dx, word ptr [var1 + 2] ; DX = 43'42h = DH:DL

    sub  esp, 4                  ; reserve stack space
    mov  [esp+0], DH
    mov  [esp+1], DL
    mov  [esp+2], AH             ; 4x byte stores
    mov  [esp+3], AL             ; constructing a byte-reversed copy
    
    mov   eax, [esp]             ; reload that
    add   esp, 4                 ; release the stack allocation

In my comments I separated the bytes with ' when writing the 32-bit hex representation of the EAX value to make it easier to see the byte boundaries vs. 44414243h

This is far fewer instructions that proposed by @Pablo's answer. It does have the performance downside of causing a store-forwarding stall. (The dword reload of EAX is data that was just written by 4 separate byte stores). But vs. 16 add instructions, it's probably still better latency as well as throughput! A store-forwarding stall is "only" about 15 cycles total store->reload latency, but may block other store-forwarding during that time, or at least multiple store-forwarding failures don't seem to be able to be in flight at once. See another paragraph below.

Doing 4x byte loads and two word stores would be an option, but that would require partial-register merging and probably be worse on Intel. See the part 1 implementation below that does that, near the end of this answer.

Efficiently doing the first part

If you could use any instructions you wanted, you'd just use a left-rotate to move the high 8 bits down to the start, and move the other bits left to make room.

                                     ;             MSB(var4)    LSB(var1)
                                     ; dword at var1 = 44'43'42'41h
    rol   dword ptr [var1], 8        ; dword at var1 = 43'42'41'44h

Or you could use a left shift by 1 byte (8 bits):

    mov  eax, dword ptr [var1]        ; EAX = 44'43'42'41h
    add  eax, eax                     ; shift left by 8 bits, 1 at a time
    add  eax, eax
    add  eax, eax
    add  eax, eax
    add  eax, eax
    add  eax, eax
    add  eax, eax
    add  eax, eax                     ; EAX = 43'42'41'00h  ; after shifting

    mov  al, [var4]                   ; EAX = 43'42'41'44h  ; replace the low byte
    mov  dword ptr [var1], eax        ; overwrite var1..4 with the bytes of EAX

Writing AL and then reading EAX can cause a partial-register stall on some old CPUs, but we don't need to write AH separately so modern CPUs won't have any performance problem here.

It would be really handy if we could just do a DWORD store that wrote outside the 4 bytes, i.e. stepping on a hypothetical var5. Then we could just do

;; writes 1 byte past var4
    mov  eax, dword ptr [var1]
    mov  dl,  [var4]

    mov dword ptr [var2], eax         ; overwrite var2..4 and 1 extra byte past the end
    mov  [var1], dl

We can use some temporary storage on the stack to make this work, then copy back to var1, but that's kind of clunky:

   sub  esp, 8              ; reserve some stack space
   mov  eax, dword ptr [var1]        ; EAX = 44'43'42'41h

   mov  [esp], eax
   mov  [esp+4], eax                 ; 2 copies of EAX so we can take any 4-byte window

   mov  eax, [esp+3]                 ; EAX = 43'42'41'44h = rotated
   mov  dword ptr [var1], eax        ; and store that over var1..4
   add  esp, 8              ; and dealloc it.

This emulates a rotate by concatenating 2 copies of the byte sequence in memory (in two dwords). Then we can load any 4-byte window. We choose the one where the low byte comes from the high byte of the original value, a 1-byte (8-bit) left rotate = 3-byte (24-bit) right rotate.

This will cause a store-forwarding stall, unfortunately. (Two word stores being read by a load that overlaps both of them). So about ~15 cycle latency instead of the usual ~5 on a modern x86. (https://agner.org/optimize).

It's maybe better for throughput than the version with 8x add but worse for latency. Or maybe worse for throughput as well, since independent chains of add can overlap their execution. Multiple store-forwarding stalls can't be happening at once on Intel CPUs at least, so that's a throughput limitation if you're doing a lot of this, not just this between a lot of other work.

With only byte loads:

Loading everything before doing any stores means we don't have to worry about destroying data we haven't read yet.

    mov  ah, [var1]       ; EAX = ??'??'41'??h
    mov  al, [var4]       ; EAX = ??'??'41'44h
    mov  ch, [var3]       ; ECX = ??'??'43'??h
    mov  cl, [var2]       ; ECX = ??'??'43'42h

    mov  word ptr [var3], cx  ;  var3 = CL = 42h   var4 = CH = 43h
    mov  word ptr [var1], ax  ;  var1 = AL = 44h   var2 = AH = 41h

The other obvious option is two word loads and 4 byte stores. For performance, that would avoid partial-register merging stalls (from writing AL/AH and then reading AX) on Intel CPUs, but load throughput is usually better than store throughput. Although the AH-merging penalty might be worse than that on Haswell/Skylake

But since we want to keep 2 of the source bytes in order, we can combine the CL and CH loads into one CX load:

Perhaps the most efficient version of part 1

    mov  ah, [var1]              ; EAX = ??'??'41'??h
    mov  al, [var4]              ; EAX = ??'??'41'44h
    mov  cx, word ptr [var2]     ; ECX = ??'??'43'42h

    mov  word ptr [var3], cx  ;  var3 = CL = 42h   var4 = CH = 43h
    mov  word ptr [var1], ax  ;  var1 = AL = 44h   var2 = AH = 41h

Notice that we're doing a word load from var2 then a word store to var3, shifting those 2 bytes left by 8 bits / 1 byte.

The only stall here is reading AX after writing AL+AH, and that's at most a cycle in the front end (during which the merging uop issues by itself) on Intel Sandybridge-family.

If later code is going to do a dword load of the 4 bytes we rearranged very soon after this, writing it in 2 word halves will lead to a store forwarding stall. But my bswap emulation above only reloads in two word halves so that's fine.

This also has a false dependency on the old value of EAX and ECX (which you could avoid for ECX by doing a dword load, but still a word store). Or for EAX, mov eax,0 or sub eax,eax will zero it and (on most CPUs) break dependencies on the old value, allowing out-of-order execution of this sequence of instructions even if the last block that used ECX for a different value is still stalled.

This is fewer instructions (and each instruction is still fairly efficient) than either the 8x add version or the version using scratch space on the stack. And doesn't store/reload to a temporary.

Combining part 1 and part 2:

Ending up with the desired data in EAX and byte-reversed in memory could be combined into a single step, perhaps reusing a temporary on the stack.

Of course if you weren't restricted in instruction choice you could do is pretty efficiently:

    mov   eax, dword ptr [var1]
    ror   eax, 8
    mov   dword ptr [var1], eax
    bswap eax

Optimizing part 1 and 2 together with only mov and add/sub is left as an exercise for the reader.

Other fun ideas: XOR swap also works with SUB, so you could inefficiently swap a byte between a register and memory. But you're not limited in using tmp regs so it's going to be better to load into a separate byte reg.

Upvotes: 0