Reputation: 681
Our teacher in class gave us an assignment to do. The issue is that we did not cover bit shifting in class, so I am a little bit lost on how to do this. In the assignment instructions he tells us the following:
We start out with 4 variables:
The first part of the assignment is to move these around so we end up with the following order:
I successfully completed this part, I just provided this for context. The next part is where I am having issues. In part two we need to move these vars into register eax. var1 must be stored in the highest byte of eax, var2 in the second highest, war3 in the second lowest byte and var4 in the lowest.
The end result should give eax = 444144243
So here is what I do know about this problem:
How do I go about shifting var1 so that it ends up in the upper 16 bits of eax and so forth with the other vars?
NOTE: I have no code as I have no idea where to even start with this part of the assignment. I do not want it solved, I just want help on finding the right path to take to solve it.
Upvotes: 4
Views: 4622
Reputation: 365237
I know I need to use the binary values of these hex values to shift them to the left by adding 1 or something like that.
Hex is a human readable way to print binary values. Your bytes in registers and memory are just made of bits because computers use binary logic.
Since you can't use shift or rotate instructions (other than left-shift by 1 using add same,same
), I'd consider using store / reload to a temporary get bytes in the desired order.
Notice that the desired result only reorders data with byte granularity, i.e. moving around whole bytes. No bits have to move between bytes. So you don't necessarily need bit-shifts.
Your idea isn't wrong but it requires a lot of instructions. I'm just presenting other ways.
The point of the exercise seems to be teaching you about how AX, AH, and AL alias onto parts of EAX. (And similar for ECX, EDX, and EBX). And/or how a dword load/store affects the 4 individual bytes of memory it touches, i.e. little-endian byte order.
I don't really see the point of forbidding shl eax,8
or 16, if the intended solution is to use add
16 times as a shift. There's still the issue of using partial registers to modify bytes of EAX. I mean, using the x+x = 2*x = x<<1
identity is often useful, e.g. doing math in addressing modes with LEA like x*3
= lea eax, [eax + eax*2]
. Or using add same,same
as a more efficient shl eax,1
.
I successfully completed this part, I just provided this for context. The next part is where I am having issues. In part two we need to move these vars into register EAX.
var1
must be stored in the highest byte of eax [...]
Assuming the vars are stored in the order shown (with var1
first, thus at the lowest address:
The order you specified is the opposite of what you'd get from a 4-byte load that covered all 4 bytes: x86 is little-endian so the lowest-address byte ends up in the low byte of EAX (aka AL).
Normally you'd do a dword load then bswap eax
to reverse the order of the 4 bytes in the register. (Or on newer CPUs that support it, a movbe
load: mov big-endian data is usually at least as efficient as separate load + bswap.) But the exercise forces you to get fancier.
Assuming your 4 vars are declared contiguously, with var1
first (at the lowest address), you load multiple bytes at once with a word or dword mov
load. e.g.
mov eax, dword ptr [var1] ; AL=var1, AH=var2, higher bytes = var3, var4
MASM associates a size with data labels, so to keep it happy we have to apply the dword ptr
size override to the memory operand to make it match the dword size of the register EAX. mov
always requires both operands to be the same size, but for a case like mov [esp], eax
the register source operand EAX implies the size for the memory destination. Without a label (or 2 registers) there is no possibility of mismatch. It's also optional when the size matches on its own, like for mov al, [var1]
As far as the CPU is concerned everything is just bytes; the dword ptr
override is purely a source-level thing to keep the assembler happy. Other assemblers like NASM wouldn't require it. And it's only because the addressing mode includes a data label that MASM considers a "variable" with a size associated with it.
Given that you've already done part1 and stored that to memory, so you have 44h, 41h, 42h, 43h
in memory in that order (i.e. 43'42'41'44h
as a little-endian dword) you might go about part 2 something like this:
mov ax, word ptr [var1] ; AX = 41'44h = AH:AL
mov dx, word ptr [var1 + 2] ; DX = 43'42h = DH:DL
sub esp, 4 ; reserve stack space
mov [esp+0], DH
mov [esp+1], DL
mov [esp+2], AH ; 4x byte stores
mov [esp+3], AL ; constructing a byte-reversed copy
mov eax, [esp] ; reload that
add esp, 4 ; release the stack allocation
In my comments I separated the bytes with '
when writing the 32-bit hex representation of the EAX value to make it easier to see the byte boundaries vs. 44414243h
This is far fewer instructions that proposed by @Pablo's answer. It does have the performance downside of causing a store-forwarding stall. (The dword reload of EAX is data that was just written by 4 separate byte stores). But vs. 16 add
instructions, it's probably still better latency as well as throughput! A store-forwarding stall is "only" about 15 cycles total store->reload latency, but may block other store-forwarding during that time, or at least multiple store-forwarding failures don't seem to be able to be in flight at once. See another paragraph below.
Doing 4x byte loads and two word
stores would be an option, but that would require partial-register merging and probably be worse on Intel. See the part 1 implementation below that does that, near the end of this answer.
If you could use any instructions you wanted, you'd just use a left-rotate to move the high 8 bits down to the start, and move the other bits left to make room.
; MSB(var4) LSB(var1)
; dword at var1 = 44'43'42'41h
rol dword ptr [var1], 8 ; dword at var1 = 43'42'41'44h
Or you could use a left shift by 1 byte (8 bits):
mov eax, dword ptr [var1] ; EAX = 44'43'42'41h
add eax, eax ; shift left by 8 bits, 1 at a time
add eax, eax
add eax, eax
add eax, eax
add eax, eax
add eax, eax
add eax, eax
add eax, eax ; EAX = 43'42'41'00h ; after shifting
mov al, [var4] ; EAX = 43'42'41'44h ; replace the low byte
mov dword ptr [var1], eax ; overwrite var1..4 with the bytes of EAX
Writing AL and then reading EAX can cause a partial-register stall on some old CPUs, but we don't need to write AH separately so modern CPUs won't have any performance problem here.
It would be really handy if we could just do a DWORD store that wrote outside the 4 bytes, i.e. stepping on a hypothetical var5
. Then we could just do
;; writes 1 byte past var4
mov eax, dword ptr [var1]
mov dl, [var4]
mov dword ptr [var2], eax ; overwrite var2..4 and 1 extra byte past the end
mov [var1], dl
We can use some temporary storage on the stack to make this work, then copy back to var1
, but that's kind of clunky:
sub esp, 8 ; reserve some stack space
mov eax, dword ptr [var1] ; EAX = 44'43'42'41h
mov [esp], eax
mov [esp+4], eax ; 2 copies of EAX so we can take any 4-byte window
mov eax, [esp+3] ; EAX = 43'42'41'44h = rotated
mov dword ptr [var1], eax ; and store that over var1..4
add esp, 8 ; and dealloc it.
This emulates a rotate by concatenating 2 copies of the byte sequence in memory (in two dwords). Then we can load any 4-byte window. We choose the one where the low byte comes from the high byte of the original value, a 1-byte (8-bit) left rotate = 3-byte (24-bit) right rotate.
This will cause a store-forwarding stall, unfortunately. (Two word stores being read by a load that overlaps both of them). So about ~15 cycle latency instead of the usual ~5 on a modern x86. (https://agner.org/optimize).
It's maybe better for throughput than the version with 8x add
but worse for latency. Or maybe worse for throughput as well, since independent chains of add
can overlap their execution. Multiple store-forwarding stalls can't be happening at once on Intel CPUs at least, so that's a throughput limitation if you're doing a lot of this, not just this between a lot of other work.
Loading everything before doing any stores means we don't have to worry about destroying data we haven't read yet.
mov ah, [var1] ; EAX = ??'??'41'??h
mov al, [var4] ; EAX = ??'??'41'44h
mov ch, [var3] ; ECX = ??'??'43'??h
mov cl, [var2] ; ECX = ??'??'43'42h
mov word ptr [var3], cx ; var3 = CL = 42h var4 = CH = 43h
mov word ptr [var1], ax ; var1 = AL = 44h var2 = AH = 41h
The other obvious option is two word
loads and 4 byte
stores. For performance, that would avoid partial-register merging stalls (from writing AL/AH and then reading AX) on Intel CPUs, but load throughput is usually better than store throughput. Although the AH-merging penalty might be worse than that on Haswell/Skylake
But since we want to keep 2 of the source bytes in order, we can combine the CL and CH loads into one CX load:
mov ah, [var1] ; EAX = ??'??'41'??h
mov al, [var4] ; EAX = ??'??'41'44h
mov cx, word ptr [var2] ; ECX = ??'??'43'42h
mov word ptr [var3], cx ; var3 = CL = 42h var4 = CH = 43h
mov word ptr [var1], ax ; var1 = AL = 44h var2 = AH = 41h
Notice that we're doing a word load from var2
then a word store to var3
, shifting those 2 bytes left by 8 bits / 1 byte.
The only stall here is reading AX after writing AL+AH, and that's at most a cycle in the front end (during which the merging uop issues by itself) on Intel Sandybridge-family.
If later code is going to do a dword load of the 4 bytes we rearranged very soon after this, writing it in 2 word halves will lead to a store forwarding stall. But my bswap emulation above only reloads in two word
halves so that's fine.
This also has a false dependency on the old value of EAX and ECX (which you could avoid for ECX by doing a dword load, but still a word store). Or for EAX, mov eax,0
or sub eax,eax
will zero it and (on most CPUs) break dependencies on the old value, allowing out-of-order execution of this sequence of instructions even if the last block that used ECX for a different value is still stalled.
This is fewer instructions (and each instruction is still fairly efficient) than either the 8x add
version or the version using scratch space on the stack. And doesn't store/reload to a temporary.
Ending up with the desired data in EAX and byte-reversed in memory could be combined into a single step, perhaps reusing a temporary on the stack.
Of course if you weren't restricted in instruction choice you could do is pretty efficiently:
mov eax, dword ptr [var1]
ror eax, 8
mov dword ptr [var1], eax
bswap eax
Optimizing part 1 and 2 together with only mov
and add/sub is left as an exercise for the reader.
Other fun ideas: XOR swap also works with SUB, so you could inefficiently swap a byte between a register and memory. But you're not limited in using tmp regs so it's going to be better to load into a separate byte reg.
Upvotes: 0
Reputation: 369
To shift a value to the left you need to ADD the same value to itself
For example, if you have 0011
0011 + 0011 = 0110 (shift 1 left)
0110 + 0110 = 1100 (shift 1 left again)
To solve your problem I would to the following (quick way)
MOV ah, var1 (move 44h to 0000h -> 4400h)
MOV al, var2 (move 41h to 4400h -> 4441h)
ADD eax, eax (4441h + 4441h = 8882h)
ADD eax, eax (8882h + 8882h = 11104h)
ADD eax, eax (11104h + 11104h = 22208h)
ADD eax, eax (22208h + 22208h = 44410h)
ADD eax, eax
ADD eax, eax
ADD eax, eax
ADD eax, eax (444100h)
ADD eax, eax
ADD eax, eax
ADD eax, eax
ADD eax, eax (4441000h)
ADD eax, eax
ADD eax, eax
ADD eax, eax
ADD eax, eax (44410000h)
Now to the other part
MOV ah, var3 (move 42h to 44410000h -> 44414200h)
MOV al, var4 (move 43h to 44414200h -> 44414243h)
Now your eax register is the following: 4441 4243h
Upvotes: 3