Vitaly
Vitaly

Reputation: 875

Round to Nearest in IEEE 754

In IEEE 754 there is a "Round to Nearest" method of rounding floating point values.

But I do not understand one item in that definition:

If the two nearest representable values are equally near, the one with its least significant bit zero is chosen

What is "least significant bit zero is chosen"

Upvotes: 3

Views: 5747

Answers (3)

Gaslight Deceive Subvert
Gaslight Deceive Subvert

Reputation: 20372

It simply means that tie-breakers are resolved by rounding to even. It is also known as Banker's rounding. For example, 3.5 is rounded to 4.0, but 4.5 is rounded to 4.0. This also affects numbers that are too large to be exactly represented. For example, in 32-bit floating point the integer 16777219 is rounded to 16777220.0 and not 16777218.0 because the latter's representation ends with a one.

Upvotes: 1

Vitaly
Vitaly

Reputation: 875

It looks like I understood the issue. Single and Double precission numbers can be represented as 32 and 64 sequence of bits with the following way:

b bbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb

b bbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb

Here b is zero or one. First group corresponds to sign of a number. Second group corresponds to exponent of a number and consist of 8 (single precission) and 11 (double precision) bits. Third group corresponds to mantissa of a number and consist of 23 (single precission) and 52 (double precision) bits.

Hence, the least significant bit of a number is 23d bit of mantissa for single precission number and 52d bit of mantissa for double precission number. This is the rightmost bit of a number. If this bit is zero it will be chosen.

Note: Even and odd numbers are defined only for integer values. Hence, if rounding function rounds numbers only to integer values this rule degenerates to round-to-even rule

Thanks to everyone for your efforts.

Upvotes: 2

Pascal Cuoq
Pascal Cuoq

Reputation: 80276

The best way to play with the round-to-even rule is to round double-precision numbers written in hexadecimal to single-precision, for instance in the C99 or Java programming languages.

Single-precision has 23 explicit binary digits, so that the numbers 0x1.000000p0, 0x1.000002p0, 0x1.000004p0, … are single-precision numbers, but the numbers in-between aren't.

When a value is exactly in-between two consecutive single-precision floating-point numbers l and u, the binary expansions of l and u differ in the the 23 bit after the dot in the notation 1.bbbbbbbbbbbbbbbbbbbbbbbbb * 2exp. This is a simple consequence of l and u being consecutive.

The double-precision numbers 0x1.000001p0, 0x1.000003p0, 0x1.000005p0, … are exactly in-between two single-precision numbers are need to be rounded according to the “least significant bit zero” rule.

Example C99 program:

#include <stdio.h>
#include <stdlib.h>

int main(int c, char *v[]) {
  double d = 0x1.000001p0;
  for (int i = 0; i < 10; i++) {
    printf("double-precision:%.6a\n"
           "single-precision:%.6a\n\n",
           d, (float) d);
    d += 0x0.000002p0;
  }
}

Results illustrating how rounding goes to the single-precision value with a 0 as 23d binary digit after the dot:

double-precision:0x1.000001p+0
single-precision:0x1.000000p+0

double-precision:0x1.000003p+0
single-precision:0x1.000004p+0

double-precision:0x1.000005p+0
single-precision:0x1.000004p+0

double-precision:0x1.000007p+0
single-precision:0x1.000008p+0

double-precision:0x1.000009p+0
single-precision:0x1.000008p+0

double-precision:0x1.00000bp+0
single-precision:0x1.00000cp+0

double-precision:0x1.00000dp+0
single-precision:0x1.00000cp+0

double-precision:0x1.00000fp+0
single-precision:0x1.000010p+0

double-precision:0x1.000011p+0
single-precision:0x1.000010p+0

double-precision:0x1.000013p+0
single-precision:0x1.000014p+0

Upvotes: 1

Related Questions