DuckSauce
DuckSauce

Reputation: 37

Calculating range of float in C

Title pretty much sums it all.

I know that floats are 32bit total with 23bits for mantissa and 8bits for the exponent value and 1 for signing.

Calculating the range of "int" is pretty simple: 32bits = 32-1bit signature =31bits ==> Range is therefore 2³¹= 2.14e9

The formula makes sense...

Now i've looked around stackoverflow but all the answers i've found regarding float range calculations lacked substance. Just a bunch of numbers appearing randomly in the responses and magically reaching the 3.4e38 conclusion.

I'm looking for an answer from someone with real knowledge of subject. Someone that can explain through the use of a formula how this range is calculated.

Thank you all.

Mo.

Upvotes: 2

Views: 7800

Answers (2)

chux
chux

Reputation: 153456

C does not define float as described by OP. The one suggested by OP: binary32, the most popular, is one of many conforming formats.

What C does define

5.2.4.2.2 Characteristics of floating types

s sign (±1)
b base or radix of exponent representation (an integer > 1)
e exponent (an integer between a minimum emin and a maximum emax)
p precision (the number of base-b digits in the significand)
fk nonnegative integers less than b (the significand digits)

x = s*power(b,e)*Σ(k=1, p, f[k]*power(b,-k))

For binary32, the max value is

x = (+1)*power(2, 128)*(0.1111111111 1111111111 1111 binary)

x = 3.402...e+38

Given 32-bits to define a float many other possibilities occur. Example: A float could exist just like binary32, yet not support infinity/not-a-number. The leaves another exponent available numbers. The max value is then 2*3.402...e+38.


binary32 describes its significand ranging up to 1.11111... binary. The C characteristic formula above ranges up to 0.111111...

Upvotes: 1

lanceg
lanceg

Reputation: 76

C uses single-precision floating point notation, which means that a 32-bit float has 1 bit for the sign, 8 bits for the exponent, and 23 bits for the mantissa. The mantissa is calculated by summing each mantissa bit * 2^(- (bit_index)). The exponent is calculated by converting the 8 bit binary number to a decimal and subtracting 127 (thus you can have negative exponents as well), and the sign bit indicates whether or not is negative. The formula is thus:

(-1)^S * 1.M * 2^(E - 127)

Where S is the sign, M is the mantissa, and E is the exponent. See https://en.wikipedia.org/wiki/Single-precision_floating-point_format for a better mathematical explanation.

To explicitly answer your question, that means for a 32 bit float, the largest value is (-1)^0 * 1.99999988079071044921875 * 2^128, which is 6.8056469327705771962340836696903385088 × 10^38 according to Wolfram. The smallest value is the negative of that.

Upvotes: 0

Related Questions