user361633
user361633

Reputation: 203

C++ floating-point console output issue

float x = 384.951257;
std::cout << std::fixed << std::setprecision(6) << x << std::endl;

The output is 384.951263. Why? I'm using gcc.

Upvotes: 3

Views: 427

Answers (3)

legends2k
legends2k

Reputation: 32924

All answers here talk as though the issue is due to floating-point numbers and their capacity, but those are just implementation details; the issue is deeper than that. This issue occurs when representing decimal numbers using binary number system. Even something as simple as 0.1)10 is not precisely representable in binary, since it can only represent those numbers as a finite fraction where the denominator is a power of 2. Unfortunately, this does not include most of the numbers that can be represented as finite fraction in base 10, like 0.1.

The single-precision float datatype usually gets mapped to binary32 as called by the IEEE 754 standard, has 32-bits which is partitioned into 1 sign bit, 8 exponent bits and 23 significand bits (excluding the hidden/implicit bit). Thus we've to calculate upto 24 bits when converting to binary32.

Other answers here evade the actual calculations involved, I'll try to do it. This method is explained in greater detail here. So lets convert the real number into a binary number:

Integer part 384)10 = 110000000)2 (using the usual method of successive division by 2)

Fractional part 0.951257)10 can be converted by successive multiplication by 2 and taking the integer part

0.951257 * 2 = 1.902514

0.902514 * 2 = 1.805028

0.805028 * 2 = 1.610056

0.610056 * 2 = 1.220112

0.220112 * 2 = 0.440224

0.440224 * 2 = 0.880448

0.880448 * 2 = 1.760896

0.760896 * 2 = 1.521792

0.521792 * 2 = 1.043584

0.043584 * 2 = 0.087168

0.087168 * 2 = 0.174336

0.174336 * 2 = 0.348672

0.348672 * 2 = 0.697344

0.697344 * 2 = 1.394688

0.394688 * 2 = 0.789376

Gathering the obtined fractional part in binary we've 0.111100111000010)2. The overall number in binary would be 110000000.111100111000010)2; this has 24 bits as required.

Converting this back to decimal would give you 384 + (15585 / 16384) = 384.951232)10. With the rounding mode (round to nearest) enabled this comes to, what you see, 384.951263)10.

This can be verified here.

Upvotes: 2

tillaert
tillaert

Reputation: 1835

Floats have a limited resolution. So it gets rounded when you assing the value to x.

Upvotes: 2

Cheers and hth. - Alf
Cheers and hth. - Alf

Reputation: 145299

float is usually only 32-bit. With about 3 bits per decimal digit (210 roughly equals 103) that means it can't possibly represent more than about 11 decimal digits, and accounting for other information it also needs to represent, such as magnitude, let's say 6-7 decimal digits. Hey, that's what you got!

Check e.g. Wikipedia for details.

Use double or long double for better precision. double is the default in C++. E.g., the literal 3.14 is of type double.

Upvotes: 8

Related Questions