Taylor Bishop
Taylor Bishop

Reputation: 519

How many values can be represented in a given range by a float?

Intuition tells me that since 32 bits can represent a fixed number of different values that a float could represent a fixed number of values for any given range. Is this true? Is there any loss to the number of values able to be represented by the way the conversion is handled?

Say I pick a number in the range [1030, 1035]. Obviously the precision I can get within this range is limited, but are there any differences in the number of values that can be represented in this range compared to a more reasonable range like [0.0, 1000.0]?

Upvotes: 3

Views: 3461

Answers (5)

kvantour
kvantour

Reputation: 26481

Maybe I am overlooking something here, but looking at the bit-pattern of IEEE-754 binary32

enter image description here

And you know that it is decoded as:

(-1)b31 (1 + Sum(b23-i 2 -i;i = 22 ... 0 )) × 2e - 127

Then you see that the lowest exponent is 0 and the highest is 255. If you multiply the whole number with 2127, then you see that the ordering of two floating point numbers with the same fraction is defined by the order of the exponent e, which is an integer. So if you want to sort IEEE-754 binary32 numbers from low to high, you

  • sort first on the sign,
  • second on the exponent
  • third on the fraction

This actually means that the order of the floating point numbers is identical to the order of the corresponding integers created by the same bit-pattern. So if you want to know the distance between two floating point numbers, you only need to subtract the corresponding integers with each other: (this assumes that +0 and -0 will be treated equally):

/* count the binary32 numbers in the closed half-open interval [start, stop[ */
int distance (float start, float stop)
{
    return *(reinterpret_cast<int *>(&stop)) - *(reinterpret_cast<int *>(&start));
}

Image is taken from Wikipedia: https://en.wikipedia.org/wiki/Single-precision_floating-point_format

Upvotes: 3

Eric Postpischil
Eric Postpischil

Reputation: 222660

Here is code that calculates the number of values representable in float in all finite ranges. It expects IEEE-754 arithmetic. I adapted it from my previous C++ answer.

This has two implementations of converting a floating-point number to its encoding (one by copying the bits, one by mathematically manipulating it). After that, the distance calculation is fairly simple (negative values have to be adjusted, and then the distance is simply a subtraction).

#include <float.h>
#include <inttypes.h>
#include <limits.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <tgmath.h>


/*  Define a value with only the high bit of a uint32_t set.  This is also the
    encoding of floating-point -0.
*/
static const uint32_t HighBit = UINT32_MAX ^ UINT32_MAX>>1;


//  Return the encoding of a floating-point number by copying its bits.
static uint32_t EncodingBits(float x)
{
    uint32_t result;
    memcpy(&result, &x, sizeof result);
    return result;
}


//  Return the encoding of a floating-point number by using math.
static uint32_t EncodingMath(float x)
{
    static const int SignificandBits = FLT_MANT_DIG;
    static const int MinimumExponent = FLT_MIN_EXP;

    //  Encode the high bit.
    uint32_t result = signbit(x) ? HighBit : 0;

    //  If the value is zero, the remaining bits are zero, so we are done.
    if (x == 0) return result;

    /*  The C library provides a little-known routine to split a floating-point
        number into a significand and an exponent.  Note that this produces a
        normalized significand, not the actual significand encoding.  Notably,
        it brings significands of subnormals up to at least 1/2.  We will
        adjust for that below.  Also, this routine normalizes to [1/2, 1),
        whereas IEEE 754 is usually expressed with [1, 2), but that does not
        bother us.
    */
    int xe;
    float xf = frexp(fabs(x), &xe);

    //  Test whether the number is subnormal.
    if (xe < MinimumExponent)
    {
        /*  For a subnormal value, the exponent encoding is zero, so we only
            have to insert the significand bits.  This scales the significand
            so that its low bit is scaled to the 1 position and then inserts it
            into the encoding.
        */
        result |= (uint32_t) ldexp(xf, xe - MinimumExponent + SignificandBits);
    }
    else
    {
        /*  For a normal value, the significand is encoded without its leading
            bit.  So we subtract .5 to remove that bit and then scale the
            significand so its low bit is scaled to the 1 position.
        */
        result |= (uint32_t) ldexp(xf - .5, SignificandBits);

        /*  The exponent is encoded with a bias of (in C++'s terminology)
            MinimumExponent - 1.  So we subtract that to get the exponent
            encoding and then shift it to the position of the exponent field.
            Then we insert it into the encoding.
        */
        result |= ((uint32_t) xe - MinimumExponent + 1) << (SignificandBits-1);
    }

    return result;
}


/*  Return the encoding of a floating-point number.  For illustration, we
    get the encoding with two different methods and compare the results.
*/
static uint32_t Encoding(float x)
{
    uint32_t xb = EncodingBits(x);
    uint32_t xm = EncodingMath(x);

    if (xb != xm)
    {
        fprintf(stderr, "Internal error encoding %.99g.\n", x);
        fprintf(stderr, "\tEncodingBits says %#" PRIx32 ".\n", xb);
        fprintf(stderr, "\tEncodingMath says %#" PRIx32 ".\n", xm);
        exit(EXIT_FAILURE);
    }

    return xb;
}


/*  Return the distance from a to b as the number of values representable in
    float from one to the other.  b must be greater than or equal to a.  0 is
    counted only once.
*/
static uint32_t Distance(float a, float b)
{
    uint32_t ae = Encoding(a);
    uint32_t be = Encoding(b);

    /*  For represented values from +0 to infinity, the IEEE 754 binary
        floating-points are in ascending order and are consecutive.  So we can
        simply subtract two encodings to get the number of representable values
        between them (including one endpoint but not the other).

        Unfortunately, the negative numbers are not adjacent and run the other
        direction.  To deal with this, if the number is negative, we transform
        its encoding by subtracting from the encoding of -0.  This gives us a
        consecutive sequence of encodings from the greatest magnitude finite
        negative number to the greatest finite number, in ascending order
        except for wrapping at the maximum uint32_t value.

        Note that this also maps the encoding of -0 to 0 (the encoding of +0),
        so the two zeroes become one point, so they are counted only once.
    */
    if (HighBit & ae) ae = HighBit - ae;
    if (HighBit & be) be = HighBit - be;

    //  Return the distance between the two transformed encodings.
    return be - ae;
}


static void Try(float a, float b)
{
    printf("[%.99g, %.99g] contains %" PRIu32 " representable values.\n",
        a, b, Distance(a, b) + 1);
}


int main(void)
{
    if (sizeof(float) != sizeof(uint32_t))
    {
        fprintf(stderr, "Error, uint32_t must be the same size as float.\n");
        exit(EXIT_FAILURE);
    }

    /*  Prepare some test values:  smallest positive (subnormal) value, largest
        subnormal value, smallest normal value.
    */
    float S1 = FLT_TRUE_MIN;
    float N1 = FLT_MIN;
    float S2 = N1 - S1;

    //  Test 0 <= a <= b.
    Try( 0,  0);
    Try( 0, S1);
    Try( 0, S2);
    Try( 0, N1);
    Try( 0, 1./3);
    Try(S1, S1);
    Try(S1, S2);
    Try(S1, N1);
    Try(S1, 1./3);
    Try(S2, S2);
    Try(S2, N1);
    Try(S2, 1./3);
    Try(N1, N1);
    Try(N1, 1./3);

    //  Test a <= b <= 0.
    Try(-0., -0.);
    Try(-S1, -0.);
    Try(-S2, -0.);
    Try(-N1, -0.);
    Try(-1./3, -0.);
    Try(-S1, -S1);
    Try(-S2, -S1);
    Try(-N1, -S1);
    Try(-1./3, -S1);
    Try(-S2, -S2);
    Try(-N1, -S2);
    Try(-1./3, -S2);
    Try(-N1, -N1);
    Try(-1./3, -N1);

    //  Test a <= 0 <= b.
    Try(-0., +0.);
    Try(-0., S1);
    Try(-0., S2);
    Try(-0., N1);
    Try(-0., 1./3);
    Try(-S1, +0.);
    Try(-S1, S1);
    Try(-S1, S2);
    Try(-S1, N1);
    Try(-S1, 1./3);
    Try(-S2, +0.);
    Try(-S2, S1);
    Try(-S2, S2);
    Try(-S2, N1);
    Try(-S2, 1./3);
    Try(-N1, +0.);
    Try(-N1, S1);
    Try(-N1, S2);
    Try(-N1, N1);
    Try(-1./3, 1./3);
    Try(-1./3, +0.);
    Try(-1./3, S1);
    Try(-1./3, S2);
    Try(-1./3, N1);
    Try(-1./3, 1./3);

    return 0;
}

Upvotes: 1

chux
chux

Reputation: 153456

How many values can be represented in a given range by a float?

... since 32 bits can represent a fixed number of different values that a float could represent a fixed number of values for any given range. Is this true?

Yes - true. Over the entire typical float range, about 232 different values can be represented.

Is there any loss to the number of values able to be represented by the way the conversion is handled?

non sequitur. float does not define how other numeric representations are converted to/from float. printf(), scanf(), atof(), strtof(), (float) some_integer, (some_integer_type) some_float and the compiler itself all perform conversions. C is loose on how well the conversion must occur. Quality libraries and compilers are expected to perform as best as possible. In the case of source code or "string" numbers like "1.2345", there are infinitely many possibles values mapped to about 232 different values. Yes a loss occurs.

... in the range [1030, 1035]. ... are there any differences in the number of values that can be represented in this range compared to a more reasonable range like [0.0, 1000.0]?

Yes. float values are distributed logarithmically, not linearly. Between [1030, 1035], there are about as many different float as between [1.030, 1.035] or [1.030e-3, 1.035e-3]. About 25% of all float are in the range [0.0 ... 1.0] so yet there are many times more values in [0.0, 1000.0] than [1030, 1035]

Upvotes: 1

Eric Postpischil
Eric Postpischil

Reputation: 222660

This is contributed for information purposes—it could be used to provide more readily usable information, such as code to provide the count, samples for various ranges, or discussion—but I am out of time and want to preserve the information so far.

For IEEE-754 basic 32-bit binary floating-point, the number N(x) of non-negative representable values less than or equal to a non-negative x is:

  • 223•254 if 2128x.
  • 223•(floor(log2x)+127) + floor(x/2floor(log2x)−23)−223+1 if 2−126x < 2128.
  • floor(x/2−126−23)+1 if x < 2−126.

So the number of representable values x in a < xb is N(b)−N(a).

Explanation:

  • In the first case, 223•254 is the number of representable non-negative finite values, 223 for each exponent value, including subnormals and zero.
  • In the second case, 223•(floor(log2x)+127) is 223 for each binade below that of x, including subnormals and zero. To that, we add the number less than x in the binade of x, which we get by calculating the 24-bit integer significand of x (rounded down) as floor(x/2floor(log2x)−23) and then subtracting 223−1 to count the significands from the first normal significand (223) to x’s significand, inclusive.
  • In the third case, the subnormal significands are spaced at intervals of 2−126−23, so we merely count the whole intervals and include the endpoint.

Upvotes: 0

njuffa
njuffa

Reputation: 26085

This answer assumes that float maps to the binary32 type specified by the IEEE-754 (2008) standard. For normalized binary32 operands, i.e. in [2-126, 2128), there are always exactly 223 encodings per binade, since the number of stored significand bits is 23. Determining the number of binary32 encodings in the general case is a bit trickier, for example due to rounding effects: not all powers of ten are exactly representable. It also makes a difference where within a binade the starting and end points are located, and we need to account for subnormals in [0, 2-126].

But to first order we can estimate that there are roughly as many encodings in [1030, 1035] as there are in [10-2, 103], and that therefore the interval [0, 103] will contain many more binary32 numbers than the interval [1030, 1035].

The lazy way to establish the exact count is to brute-force count the number of encodings in a given interval. The C and C++ standard math libraries provide a function nextafterf that increments or decrements a given binary32 operand to its closest neighbor in the direction indicated. So we can simply count how many times we are able to do that for a specified interval. An ISO-C99 program using this apporach is shown below. It will only take several seconds to give us the desired answer on modern hardware:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <math.h>

/* count the binary32 numbers in the closed interval [start, stop] */
void countem (float start, float stop)
{
    float x;
    uint32_t count;
    count = 0;
    x = start;
    while (x <= stop) {
        count++;
        x = nextafterf (x, INFINITY);
    }
    printf ("there are %u binary32 numbers in [%15.8e, %15.8e]\n", count, start, stop);
}

int main (void)
{
    countem (0.0f, 1000.0f);
    countem (1e-2f, 1e3f);
    countem (1e30f, 1e35f);
    return EXIT_SUCCESS;
}

This program determines:

there are 1148846081 binary32 numbers in [0.00000000e+000, 1.00000000e+003]
there are 139864311 binary32 numbers in [9.99999978e-003, 1.00000000e+003]
there are 139468867 binary32 numbers in [1.00000002e+030, 1.00000004e+035]

Upvotes: 3

Related Questions