Reputation: 221
Given a real value, can we check if a float
data type is enough to store the number, or a double
is required?
I know precision varies from architecture to architecture. Is there any C/C++ function to determine the right data type?
Upvotes: 10
Views: 25042
Reputation: 137940
Precision is not very platform-dependent. Although platforms are allowed to be different, float
is almost universally IEEE standard single precision and double
is double precision.
Single precision assigns 23 bits of "mantissa," or binary digits after the radix point (decimal point). Since the bit before the dot is always one, this equates to a 24-bit fraction. Dividing by log2(10) = 3.3, a float gets you 7.2 decimal digits of precision.
Following the same process for double
yields 15.9 digits and long double
yields 19.2 (for systems using the Intel 80-bit format).
The bits besides the mantissa are used for exponent. The number of exponent bits determines the range of numbers allowed. Single goes to ~ 10±38, double goes to ~ 10±308.
As for whether you need 7, 16, or 19 digits or if limited-precision representation is appropriate at all, that's really outside the scope of the question. It depends on the algorithm and the application.
Upvotes: 3
Reputation: 47377
I think your question presupposes a way to specify any "real number" to C / C++ (or any other program) without precision loss.
Suppose that you get this real number by specifying it in code or through user input; a way to check if a float or a double would be enough to store it without precision loss is to just count the number of significant bits and check that against the data range for float and double.
If the number is given as an expression (i.e. 1/7
or sqrt(2)
), you will also want ways of detecting:
More over, there are numbers, such as 0.9
, that float / double cannot in theory represent "exactly" )at least not in our binary computation paradigm) - see Jon Skeet's excellent answer on this.
Lastly, see additional discussion on float vs. double.
Upvotes: 3
Reputation: 26185
For background, see What Every Computer Scientist Should Know About Floating-Point Arithmetic
Unfortunately, I don't think there is any way to automate the decision.
Generally, when people represent numbers in floating point, rather than as strings, the intent is to do arithmetic using the numbers. Even if all the inputs fit in a given floating point type with acceptable precision, you still have to consider rounding error and intermediate results.
In practice, most calculations will work with enough precision for usable results, using a 64 bit type. Many calculations will not get usable results using only 32 bits.
In modern processors, buses and arithmetic units are wide enough to give 32 bit and 64 bit floating point similar performance. The main motivation for using 32 bit is to save space when storing a very large array.
That leads to the following strategy:
If arrays are large enough to justify spending significant effort to halve their size, do analysis and experiments to decide whether a 32 bit type gives good enough results, and if so use it. Otherwise, use a 64 bit type.
Upvotes: 4
Reputation: 2894
A very detailed post that may or may not answer your question.
An entire series in floating point complexities!
Upvotes: 1
Reputation:
You cannot represent real number with float or double variables, but only a subset of rational numbers.
When you do floating point computation, your CPU floating point unit will decide the best approximation for you.
I might be wrong but I thought that float (4 bytes) and double (8 bytes) floating point representation were actually specified independently of comp architectures.
Upvotes: 0
Reputation: 1891
Couldn't you simply store it to a float
and a double
variable and than compare these two? This should implicitely convert the float
back to a double - if there is no difference, the float
is sufficient?
float f = value;
double d = value;
if ((double)f == d)
{
// float is sufficient
}
Upvotes: 0