x86 Assembly (SSE): Unexpected Multiplication Result

Question

The following code should quantize a positive (single precision) floating-point number to a 32-bit integer. As the positive range contains only 2^31 - 1 (discrete) levels, code multiplies sample by this value, and rounds the result to an integer:

mov eax, 0x7FFFFFFF   // eax = 2^31 - 1
cvtsi2ss xmm1, eax    // convert eax to float --> xmm1
movss xmm0, [sample]  // where 'sample' is of type float
mulss xmm0, xmm1      // Get sample's quantum into xmm0
cvtss2si eax, xmm0    // Round quantum to the nearest integer --> eax

Problem is: that for a sample value of 1.0f, the end result (eax value) is 0x80000000 = 2^31, which is out of range. The expected result would be 1.0 x (2^31 - 1) = (2^31 - 1) = 0x7FFFFFFF.

Moreover, this value is actually the 2's complement representation of -2^31 (note the minus sign).

What am I missing here?

{ MSVC2010 is being used for the testing. } `

Michael Petch · Accepted Answer

You move 2³¹-1 to EAX and convert it from a 32-bit integer to a single (32-bit) scalar float.

mov eax, 0x7FFFFFFF   // eax = 2^31 - 1
cvtsi2ss xmm1, eax    // convert eax to float --> xmm1

The problem is that there isn't enough mantissa in an IEEE754 32-bit float to represent 2³¹-1 accurately. It actually gets rounded up to 2.147483648E9. There is an online binary converter that can better describe how this occurred. The conversion of integer 2³¹-1 to a single scalar float 2.147483648E9 is demonstrated here

Exactly represent every integer from 0 to 2³¹-1 takes 31 bits. A 32bit IEEE float (with a 23 + 1 implicit bit mantissa) can exactly represent every integer with magnitude up to 2²⁴. Outside that range, powers of 2 are exactly representable.

It's provable (with information theory) that it's impossible to design a 31 bit encoding that can exactly represent all the integers from 0 to 2³¹-1 and also be able to represent any other values. The integers use up all the coding space. If such a thing were possible, you could use the technique repeatedly to compress all the world's data into one bit.

The 0x80000000 result is how cvtss2si and cvtsd2si signal overflow. From the Intel insn ref manual (see the x86 wiki for links):

If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised, and if this exception is masked, the indefinite integer value (80000000H) is returned.

It's nothing to do with integer wraparound, or the float value being one past the exact result.

Note that with a 64bit integer register, cvtss2si rax, xmm1 can produce results up to 0x7fffff8000000000, with larger floats producing the 0x8000000000000000 "indefinite value". This is contrary to the text description in Intel's manual, where they forgot to update the max-value paragraph for 64bit operand-size to match what cvtsd2si says. The largest integer you can round-trip to single-precision float without producing an overflow is 0x7fffffbfffffffff.

If you use a double scalar there is enough mantissa to accurately represent 2³¹-1. The conversion of integer 2³¹-1 to a double scalar float 2.147483647E9 is demonstrated here.

As Jester pointed out that using double (64-bit) scalar floats your problem would be rectified. That code could look something like:

double sample = 1.0f;

__asm
{
    mov eax, 0x7FFFFFFF   // eax = 2^31 - 1
    cvtsi2sd xmm1, eax    // convert eax to double float --> xmm1
    movsd xmm0, [sample]  // where 'sample' is of type double float
    mulsd xmm0, xmm1      // Get sample's quantum into xmm0
    cvtsd2si eax, xmm0    // Round quantum to the nearest integer --> eax
}

If you wanted to keep sample as a 32-bit float instead of the double in my example, you could replace movsd xmm0, [sample] with cvtss2sd xmm0, [sample]

Given that this answer is based upon the input of multiple contributors, I've marked this as community wiki, so feel free to edit.

x86 Assembly (SSE): Unexpected Multiplication Result

Answers (1)

Related Questions