Reputation: 438
The following code should quantize a positive (single precision) floating-point number to a 32-bit integer.
As the positive range contains only 2^31 - 1
(discrete) levels, code multiplies sample by this value, and rounds the result to an integer:
mov eax, 0x7FFFFFFF // eax = 2^31 - 1
cvtsi2ss xmm1, eax // convert eax to float --> xmm1
movss xmm0, [sample] // where 'sample' is of type float
mulss xmm0, xmm1 // Get sample's quantum into xmm0
cvtss2si eax, xmm0 // Round quantum to the nearest integer --> eax
Problem is: that for a sample
value of 1.0f
, the end result (eax
value) is 0x80000000 = 2^31
, which is out of range.
The expected result would be 1.0 x (2^31 - 1) = (2^31 - 1) = 0x7FFFFFFF
.
Moreover, this value is actually the 2's complement representation of -2^31
(note the minus sign).
What am I missing here?
{ MSVC2010 is being used for the testing. } `
Upvotes: 1
Views: 485
Reputation: 47573
You move 231-1 to EAX and convert it from a 32-bit integer to a single (32-bit) scalar float.
mov eax, 0x7FFFFFFF // eax = 2^31 - 1
cvtsi2ss xmm1, eax // convert eax to float --> xmm1
The problem is that there isn't enough mantissa in an IEEE754 32-bit float to represent 231-1 accurately. It actually gets rounded up to 2.147483648E9. There is an online binary converter that can better describe how this occurred. The conversion of integer 231-1 to a single scalar float 2.147483648E9 is demonstrated here
Exactly represent every integer from 0 to 231-1 takes 31 bits. A 32bit IEEE float (with a 23 + 1 implicit bit mantissa) can exactly represent every integer with magnitude up to 224. Outside that range, powers of 2 are exactly representable.
It's provable (with information theory) that it's impossible to design a 31 bit encoding that can exactly represent all the integers from 0 to 231-1 and also be able to represent any other values. The integers use up all the coding space. If such a thing were possible, you could use the technique repeatedly to compress all the world's data into one bit.
The 0x80000000
result is how cvtss2si
and cvtsd2si
signal overflow. From the Intel insn ref manual (see the x86 wiki for links):
If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned.
It's nothing to do with integer wraparound, or the float value being one past the exact result.
Note that with a 64bit integer register, cvtss2si rax, xmm1
can produce results up to 0x7fffff8000000000
, with larger floats producing the 0x8000000000000000
"indefinite value". This is contrary to the text description in Intel's manual, where they forgot to update the max-value paragraph for 64bit operand-size to match what cvtsd2si
says. The largest integer you can round-trip to single-precision float without producing an overflow is 0x7fffffbfffffffff
.
If you use a double scalar there is enough mantissa to accurately represent 231-1. The conversion of integer 231-1 to a double scalar float 2.147483647E9 is demonstrated here.
As Jester pointed out that using double (64-bit) scalar floats your problem would be rectified. That code could look something like:
double sample = 1.0f;
__asm
{
mov eax, 0x7FFFFFFF // eax = 2^31 - 1
cvtsi2sd xmm1, eax // convert eax to double float --> xmm1
movsd xmm0, [sample] // where 'sample' is of type double float
mulsd xmm0, xmm1 // Get sample's quantum into xmm0
cvtsd2si eax, xmm0 // Round quantum to the nearest integer --> eax
}
If you wanted to keep sample
as a 32-bit float instead of the double in my example, you could replace movsd xmm0, [sample]
with cvtss2sd xmm0, [sample]
Given that this answer is based upon the input of multiple contributors, I've marked this as community wiki, so feel free to edit.
Upvotes: 3