Reputation: 41
I'm trying to perform a right shift on Y86-64
To do a left shift I know that I need to multiply by 2^n where n is the number of bit shift we want for example if we want to shift by 4 it's 2^4 = 16 and do a loop of additions to perform multiplication on it but I'm not sure what to do for right shifts I think I need to perform division but not sure on how to approach that
pcount_do:
movq $0, %rax
.L2: movq %rdi, %rdx
shrq %rdi
ret
Upvotes: 4
Views: 2269
Reputation: 126867
Given that the instruction set of Y86 misses shifts and divisions, I'd go for something equivalent to this C code:
uint64_t lut[] = {
1,
2,
4,
8,
16,
32,
64,
128,
256,
512,
1024,
2048,
4096,
8192,
16384,
32768,
65536,
131072,
262144,
524288,
1048576,
2097152,
4194304,
8388608,
16777216,
33554432,
67108864,
134217728,
268435456,
536870912,
1073741824,
2147483648,
4294967296,
8589934592,
17179869184,
34359738368,
68719476736,
137438953472,
274877906944,
549755813888,
1099511627776,
2199023255552,
4398046511104,
8796093022208,
17592186044416,
35184372088832,
70368744177664,
140737488355328,
281474976710656,
562949953421312,
1125899906842624,
2251799813685248,
4503599627370496,
9007199254740992,
18014398509481984,
36028797018963968,
72057594037927936,
144115188075855872,
288230376151711744,
576460752303423488,
1152921504606846976,
2305843009213693952,
4611686018427387904,
9223372036854775808};
uint64_t rshift(uint64_t source, int amount) {
uint64_t result = 0;
for(int i = amount; i < 64; ++i) {
if(source & lut[i]) result |= lut[i-amount];
}
return result;
}
This should all be doable with just add/sub/and/or plus a lookup table.
If we want to be smarter, as @PeterCordes suggests we can use an 8 entries lookup table and work on whole bytes, but that requires significantly more bookkeeping than looping over each bit.
--- update ---
@PeterCordes correctly notes that the lookup table is actually useless, as I'm looping over bits, so calculating the next power of two using a sum is trivial:
uint64_t rshift(uint64_t source, int amount) {
uint64_t result = 0;
uint64_t read_bit = 1;
uint64_t write_bit = 1;
for(int i = 0; i < amount; ++i) read_bit = read_bit + read_bit;
for(int i = amount; i < 64; ++i) {
if(source & read_bit) result |= write_bit;
read_bit = read_bit + read_bit;
write_bit = write_bit + write_bit;
}
return result;
}
Upvotes: 1
Reputation: 365237
Like Matteo shows, you can loop one bit at a time, reading at one position and writing bits at another position.
Matteo's answer reads at a variable position by shifting a mask, and writing at a position that moves in lock-step, starting with the bottom of the register (shifting another mask).
It's easier to read the MSB of the input, then left-shift the input left-shift the input with add same,same
and repeat. So we read bits starting with the top bit, and construct the result starting with its MSB. (We left shift 1 bit at a time into the destination with ADD to left shift, and a conditional add to set the new bit position or not.)
We can use a 2's complement signed comparison to read the top bit of a register. If it's set, x < 0
, otherwise it's not.
x86 and y86 have a flag called SF that's set according to the MSB of the result (of an ALU operation). x86 has js
/ cmovs
/ sets
instructions that check the SF
condition directly. y86 only has jl
/ jge
and other signed-compare conditions that check SF!=OF
, so we need to do an extra compare against zero to clear OF (x - 0
can't overflow).
Or semantically, actually compare against zero instead of just reading SF. (Except we can optimize compare-against-zero into andl %eax,%eax
or andq %rax,%rax
, helpful if you're on a version of y86 that doesn't have sub-immediate. y86 is also missing x86's non-destructive test
and cmp
instructions that are like and
and sub
but only writing flags.)
Porting to y86-64 should be nearly trivial. (Change reg names, and 32 becomes 64).
Test case: 0x12345 >> 1 = 0x000091a2
. (I don't see a way to permalink the code on that site, the way the Godbolt compiler explorer allows.)
# constant input test case
irmovl 0x12345, %eax
# irmovl 3, %ecx # this could trivial handle variable counts, but doesn't.
# start of right-shift block:
# input: EAX = number to be shifted
# output: EDX = number >> 1
# clobbers: EAX, ECX, EDI. (EDI=1 to work around lack of add-immediate)
xorl %edx, %edx # dst = 0. like # irmovl $0, %edx
irmovl 1, %edi # y86 is missing immediate add?
# shift 32-n bits from EAX into the bottom of EDX
# one at a time using SF to read them from the MSB
irmovl 31, %ecx # hard code count = 32 - 31
# or calculate this as 32 - count with neg / add or equivalent
rshift: # do {
addl %edx, %edx # dst <<= 1
andl %eax, %eax # compare against zero because y86 is missing js / cmovs that tests just SF
jge MSB_zero # jge = jnl = not lower
xorl %edi, %edx # edx ^= 1. y86 is missing OR? low bit = 0 so we can ADD or XOR to set it
MSB_zero:
addl %eax, %eax # src <<= 1
subl %edi, %ecx
jne rshift # }while(--ecx); # semantically jnz
halt # result in EDX
#shr $1, %eax
I used xor-zeroing because that y86 simulator assembles to variable-length machine code like x86. (So irmovl 0, %edx
would be less efficient).
Or do the carry from MSB of EAX to LSB of EDX branchlessly with CMOVL
# loop body:
addl %edx, %edx # dst <<= 1
xorl %esi, %esi # esi = 0
sub %esi, %eax # src -= 0 to set flags
cmovl %edi, %esi # esi = (src<0) ? 1 : 0 = MSB of EAX
addl %esi, %edx # copy the bit into EDX (can't carry to higher bits)
addl %eax, %eax # src <<= 1
If your y86 simulator simulates a performance penalty for branch mispredicts, use this. Otherwise branching is fewer instructions.
Or if you care about performance, it should be possible to use a lookup table for a whole byte at a time, with fixups across byte boundaries.
But with no left-shift to efficiently assemble the separate bytes, you'd need a separate 256-entry LUT of qwords for each byte-position! Or load from an offset and then mask off the "garbage" bytes.
Oh, and you need a right shift to extract bytes from a qword to feed the array indexing. If y86 can do byte loads, then you could store the input integer to memory and reload it 1 byte at a time. Or again emulate byte loads with unaligned qword loads and AND with 0x00...0FF
to isolate that byte at the bottom of a register.
Oh crap, but we have a chicken/egg problem for runtime-variable counts. We need count / 8
as a byte offset, because there are 8 bits in a byte. But count is small, so we could use a repeated-subtraction loop for it. (You might want to AND
with 0x3f or 0x1f (depending on operand-size) to wrap the count at 64 or 32, like x86 hardware shifts do. This will avoid indexing memory way outside the correct range with counts too large.)
Anyway, then you could extend this to handle right-shift counts that are not a multiple of 8 by rounding up (shifting out too many bits), then put the needed bits back in one at a time like the loop in the first part of the question. (After an unaligned load to get those bits at the top of a register.)
Or maybe use Matteo's method of walking a mask, using a LUT for start points. But if we're already doing a store/unaligned reload for byte shifting, another reload is probably good. We can calculate the right offset for this relative to the first unaligned reload, 4 or 8 bytes before so the starting MSB is the bit just below the lowest bit of the first load.
Upvotes: 2