Reputation: 15
I want to calculate a sequence of exponents of base 2. for example if given 3 the program would calculate (2^3 + 2^2 + 2^1 + 2^0). The problem is I cant calculate the exponent to being with.
I have tried shl
to calculate the exponent and I have tried a loop. Neither produce the correct result, my latest attempt is below.
.586
.MODEL FLAT
INCLUDE io.h
.STACK 4096
.DATA
Exponent DWORD ?
Prompt BYTE "Enter an exponent", 0
Base DWORD 2
string BYTE 40 DUP (?)
resultLbl BYTE "The solution is", 0
Solution DWORD 20 DUP (?), 0
.CODE
_MainProc PROC
input Prompt, string, 40 ; read ASCII characters (N)
atod string ; ascii to double
mov exponent, eax ; store in memory
push Exponent
push Base
call function
add esp, 8
dtoa Solution, eax ; convert to ASCII characters
output resultLbl, Solution ; output label and sum
mov eax, 0 ; exit with return code 0
ret
_MainProc ENDP
;@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
; Procedure takes base and exponent from main
Function PROC
push ebp ; stores base pointer
mov ebp, esp ; set base pointer to current position
mov ebx, [ebp + 12] ; read base (2)
mov ecx, [ebp + 8] ; read N
MainLoop: ; loop to calculate expoent
mul ebx ; multiply base by its self
add eax, ebx ; save answer in eax
dec ecx ; decrement the exponent
cmp ecx, 0 ; check if exponent is 0 yet
je exit ; Exit if exponent is 0
jg mainloop ; stay in loop if exponent is above 0
Exit:
pop ebp ; restore base pointer
ret
function ENDP
END
I want the program to produce the correct exponent result, and if you can calculate the sequence above, for example if given 3 the program would calculate (2^3 + 2^2 + 2^1 + 2^0).
Upvotes: 0
Views: 1450
Reputation: 365147
Base 2 is an extremely special case because computers use binary integers.
2^n = 1 << n
i.e. 2n = 1 left-shifted n
times. i.e. that bit-position is set.
You're asking for 2^0 + 2^1 + ... 2^n
. A number with all bits set up to an including bit n
.
That's 1 less than 2^(n+1)
, so the number you want is 2^(n+1) - 1
.
x86 has an instruction to set a bit at a given position in a register: bts reg, reg
= Bit Test and Set. On Intel CPUs, the reg-reg form decodes to a single uop, with 1-cycle latency. (https://agner.org/optimize/) It also puts the old value of the bit in CF, but normally we don't care about that when using it for this. Normally we just zero a destination register and use BTS to set a single bit.
On Ryzen, bts reg,reg
is 2 uops and variable-count shifts are single-uop, so mov edx, 1
/ shl edx, cl
is good, too. But that's worse for code-size, and more uops on Intel CPUs.
On Intel Sandybridge-family CPUs, shl reg, cl
decodes to 3 uops. BMI2 shlx reg, reg, reg
is 1 uop, but BMI2 support is still far from being something you can assume, unfortunately.
(Memory-destination BTS with a register source is very slow, treating the source as an indexing into an arbitrary length bit-string, not just the addressed dword, so normally avoid it for performance.)
2^(n+1) - 1
without overflow(C version of this answer on another question, with compiler output.)
If we just added 1 to the input number before feeding it to BTS, it could wrap around and fail to work for 31
(which should set all bits, but would instead set bit 32%32 = 0
)
Since we need one extra instruction after reading the input anyway, we can use x86's shift-and-add instruction (LEA) to do one more shift as we subtract 1. So with n=31
we start with the high bit set, and shift it out to get zero. Subtracting 1 then sets all the bits, as desired
xor edx,edx ; edx = 0
bts edx, eax ; edx |= 1<<eax
lea edx, [edx + edx - 1] ; edx = (edx<<1) - 1
The logic for this goes as follows
n BTS result (2^n) *2 - 1
0 -> 1 -> 1 = 2 - 1
1 -> 2 -> 3 = 4 - 1
2 -> 4 -> 7 = 8 - 1
3 -> 8 -> 15 = 16 - 1
...
30 -> 0x40000000 -> 0x7FFFFFFF = 0x80000000 - 1
31 -> 0x80000000 -> 0xFFFFFFFF = 0 - 1
It's not a coincidence that the last column is a running total of the 2nd column.
LEA with 3 "components" to the addressing mode has extra latency vs. simpler LEAs, e.g. 3-cycle latency on Sandybridge-family vs. 1 cycle. But it's still a single uop so it's a great choice for throughput.
If we really wanted to optimize, and didn't have to worry about the n=31
case overflowing, we'd write the ASCII -> int loop by hand but start with a total of 1
instead of 0
, to fold the n+1
into finding n
. Then bts
would give us our 2^(n+1)
, and we could simply dec
that.
There's no need to store exponent
to memory and reload it again; you already have it in a register where you want it.
Your comment on the atod
line is wrong, and your code wouldn't make sense if it was right. ASCII to double
(like the C function strtod
) would return in x87 st0
or SSE2 xmm0
, not EAX. atod
actually stands for ASCII (decimal) to integer. Perhaps the d
stands for DWORD.
input Prompt, string, 40 ; read ASCII characters (N)
atod string ; ascii (decimal) to integer in EAX
xor edx,edx ; edx = 0
bts edx, eax ; edx |= 1<<eax
lea edx, [edx + edx - 1] ; edx = (edx<<1) - 1
dtoa Solution, edx
It takes 2 instructions to implement 1<<n
so it's silly to put it into its own function. Just passing args + calling the function would have take as much work as simply using bts
.
0 -> 1 -> 1
1 -> 2 -> 3
2 -> 4 -> 7
Compilers typically use mov edx,1
+ mov ecx, eax
+ shl edx, cl
. But that's a missed optimization, especially on Intel CPUs.
BMI2 shlx edx, edx, ecx
would help (avoiding the mov
), but is worse code-size than xor
-zero + bts
. And BMI2 isn't always available.
If you can't get your compiler to use BTS (or you have BMI2 for SHLX), another good implementation is (2ULL << n) - 1
so you bake the extra shift into the number you start shifting. So a count of 31
can shift the bit out and produce 0.
Upvotes: 4