Reputation: 15030
According to articles like this, half of the floating-point numbers are in the interval [-1,1]. Could you suggest how to make use of this fact so to replace the naive conversion of a 32-bit unsigned integer into a floating-point number (while keeping the uniform distribution)?
Naive code:
uint32_t i = /* randomly generated */;
float f = (float)i / (1ui32<<31) - 1.0f;
The problem here is that first the number i
is converted into float
losing up to 8 lower bits of precision. Only then the number is scaled to [0;2) interval, and then to [-1;1) interval.
Please, suggest the solution in C or C++ for x86_64 CPU or CUDA if you know it.
Update: the solution with a double
is good for x86_64, but is too slow in CUDA. Sorry I didn't expect such a response. Any ideas how to achieve this without using double-precision floating-point?
Upvotes: 3
Views: 3399
Reputation: 29952
The thing to realize is while (float)i
does lose 8-bit of precision (so it has 24 bits of precision), the result only has 24 bits of precision as well. So this precision loss is not necessarily a bad thing (this is actually more complicated, because if i
is smaller, it will lose less than 8-bits. But things will work out well).
So we just need to fix the range, so the originally non-negative value gets mapped to INT_MIN..INT_MAX
.
This expression works: (float)(int)(value^0x80000000)/0x80000000
.
Here's how it works:
(int)(value^0x80000000)
part flips the sign bit, so 0x0
gets mapped to INT_MIN
, and 0xffffffff
gets mapped to INT_MAX
.float
. This is where some rounding happens, and we lose precision (but it is not a problem).0x80000000
to get into the range [-1..1]
. As this division just adjusts the exponent part, this division doesn't lose any precision.So, there is only one rounding, the other operations doesn't lose precision. These chain of operations should have the same effect, as calculating the result in infinite precision, then doing the rounding to float
(this theoretical rounding has the same effect as the rounding at the 2. step)
But, to be absolutely sure, I've verified with brute force checking all the 32-bit values that this expression results in the same value as (float)((double)value/0x80000000-1.0)
.
Upvotes: 1
Reputation: 153457
Any ideas how to achieve this without using double-precision floating-point?
Without assuming too much about the insides of float
:
Shift u
until the most significant bit is set, halving the float
conversion value.
"keeping the uniform distribution"
50% of the uint32_t
values will be in the [0.5 ... 1.0)
25% of the uint32_t
values will be in the [0.25 ... 0.5)
12.5% of the uint32_t
values will be in the [0.125 ... 0.25)
6.25% of the uint32_t
values will be in the [0.0625 ... 0.125)
...
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
float ui32to0to1(uint32_t u) {
if (u) {
float band = 1.0f/(1llu<<32);
while ((u & 0x80000000) == 0) {
u <<= 1;
band *= 0.5f;
}
return (float)u * band;
}
return 0.0f;
}
Some test code to show functional equivalence to double
.
int test(uint32_t u) {
volatile float f0 = (float) ((double)u / (1llu<<32));
volatile float f1 = ui32to0to1(u);
if (f0 != f1) {
printf("%8lX %.7e %.7e\n", (unsigned long) u, f0, f1);
return 1;
}
return 0;
}
int main(void) {
for (int i=0; i<100000000; i++) {
test(rand()*65535u ^ rand());
}
return 0;
}
Various optimizations are possible, especially with assuming properties of float
. Yet for an initial answer, I'll stick to a general approach.
For improved efficiency, the loop needs only to iterate from 32 down to FLT_MANT_DIG
which is usually 24.
float ui32to0to1(uint32_t u) {
float band = 1.0f/(1llu<<32);
for (int i = 32; (i>FLT_MANT_DIG && ((u & 0x80000000) == 0)); i--) {
u <<= 1;
band *= 0.5f;
}
return (float)u * band;
}
This answers maps [0 to 232-1] to [0.0 to 1.0)
To map to [0 to 232-1] to (-1.0 to 1.0). It can form -0.0.
if (u >= 0x80000000) {
return ui32to0to1((u - 0x80000000)*2);
} else
return -ui32to0to1((0x7FFFFFFF - u)*2);
}
Upvotes: 0
Reputation: 51845
In case you drop the uniform distribution constraint its doable on 32bit integer arithmetics alone:
//---------------------------------------------------------------------------
float i32_to_f32(int x)
{
int exp;
union _f32 // semi result
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0x00000000) return 0.0f;
if (x< -0x1FFFFFFF) return -1.0f;
if (x> +0x1FFFFFFF) return +1.0f;
// conversion
y.u=0; // reset bits
if (x<0){ y.u|=0x80000000; x=-x; } // sign (31 bits left)
exp=((x>>23)&63)-64; // upper 6 bits -> exponent -1,...,-64 (not 7bits to avoid denormalized numbers)
y.u|=(exp+127)<<23; // exponent bias and bit position
y.u|=x&0x007FFFFF; // mantissa
return y.f;
}
//---------------------------------------------------------------------------
int f32_to_i32(float x)
{
int exp,man,i;
union _f32 // semi result
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
} y;
// edge cases
if (x== 0.0f) return 0x00000000;
if (x<=-1.0f) return -0x1FFFFFFF;
if (x>=+1.0f) return +0x1FFFFFFF;
// conversion
y.f=x;
exp=(y.u>>23)&255; exp-=127; // exponent bias and bit position
if (exp<-64) return 0.0f;
man=y.u&0x007FFFFF; // mantissa
i =(exp<<23)&0x1F800000;
i|= man;
if (y.u>=0x80000000) i=-i; // sign
return i;
}
//---------------------------------------------------------------------------
I chose to use only 29 bits + sign = ~ 30 bits of integer to avoid denormalized numbers havoc which I am too lazy to encode (it would get you 30 or even 31 bits but much slower and complicated).
But the distribution is not linear nor uniform at all:
in Red is the float
in range <-1,+1>
and Blue is integer
in range <-1FFFFFFF,+1FFFFFFF>
.
On the other hand there is no rounding at all in both conversions ...
PS. I think there might be a way to somewhat linearize the result by using a precomputed LUT for the 6 bit exponent (64 values).
Upvotes: 2
Reputation:
I suggest (if yout want to avoid division and use an accurately float-representable start value of 1.0*2^-32):
float e = i * ldexp(1.0,-32) - 1.0;
Upvotes: 0
Reputation: 223872
You can do the calculation using double
instead so you don't lose any precision on the uint32_t
value, then assign the result to a float
.
float f = (double)i / (1ui32<<31) - 1.0;
Upvotes: 2