Pooria
Pooria

Reputation: 2055

Is it possible to omit rounding of intermediate results during arithmetic operation on multiple FP operands?

Is there the possibility to do an arithmetic operation on multiple floating point operands without rounding intermediate results and just round the final result, and are there any architectures currently doing it? because as far as I've seen after two floating point operands are used in an addition/subtraction operation the result gets rounded before being used as an operand for another operation, also I've seen this.

EDIT:

Below are some examples considering single-precision format to elucidate the concept, 3 least significant bits of intermediate 27-bit mantissas taking part in arithmetic operation are guard, round and sticky bits; from examples You can see, using the intermediate mantissa bit structure utilized in a IEEE754 compliant FP system, avoiding rounding of intermediate values is possible and when it's done it'll achieve a more accurate result:

1_Example 1

A-B = 101001000010001110110100 1 0 1×2^EXP

C   = 100011001011001010010100×2^(EXP-4) --> 
C   = 000010001100101100101001 0 1 0×2^EXP

1_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B   = 101001000010001110110101×2^EXP
        A-B-C = 100110110101100010001011 1 1 0×2^EXP
Rounded A-B-C = 100110110101100010001100×2^EXP

1_2_If A-B-C is calculated without A-B being rounded:

        A-B-C = 100110110101100010001011 0 1 1×2^EXP
Rounded A-B-C = 100110110101100010001011×2^EXP

2_Example 2

A-B = 100001001100101011100000 1 0 1×2^EXP
C   = 101001011010001110001000×2^(EXP-5) --> 
C   = 000001010010110100011100 0 1 0×2^EXP

2_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B   = 100001001100101011100001×2^EXP
A-B-C         = 011111111001110111000100 1 1 0×2^EXP
A-B-C shifted 1 bit to left 
              = 111111110011101110001001 1 0 0×2^(EXP-1)
Rounded A-B-C = 11111111001110111000101×2^(EXP-1)

2_2_If A-B-C is calculated without A-B being rounded:

A-B-C         = 011111111001110111000100 0 1 1×2^EXP
A-B-C shifted 1 bit to left 
              = 111111110011101110001000 1 1 0×2^(EXP-1)
Rounded A-B-C = 111111110011101110001001×2^(EXP-1)

3_Example 3

A-B = 10000001101000110101001 1 1 1×2^EXP

C   = 100010100101011010010000×2^(EXP-6) -->
C   = 000000100010100101011010 0 1 0×2^EXP

3_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B = 10000001101000110101010×2^EXP
   A-B-C    = 01111101010100001001111 1 1 0×2^EXP
A-B-C shifted 1 bit to left 
            = 11111010101000010011111 1 0 0×2^(EXP-1)
Rounded A-B-C=11111010101000010100000×2^(EXP-1)

3_2_If A-B-C is calculated without A-B being rounded:

A-B-C       = 01111101010100001001111 1 0 1×2^EXP
A-B-C shifted 1 bit to left
            = 11111010101000010011111 0 1 0×2^(EXP-1)
Rounded A-B-C=11111010101000010011111×2^(EXP-1)

4_Example 4

A-B = 101100101000111000110101 0 1 1×2^EXP

C = 100110010110011101100000×2^(EXP-7) --> 
C = 000000010011001011001110 1 1 0×2^EXP

4_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B = 101100101000111000110110×2^EXP
A-B-C       = 101100010101101101100111 0 1 0×2^EXP
Rounded A-B-C=101100010101101101100111×2^EXP

4_2_If A-B-C is calculated without A-B being rounded:

A-B-C         = 101100010101101101100111 1 0 1×2^EXP
Rounded A-B-C = 101100010101101101101000×2^EXP

5_Example 5

A-B = 100000111011001111001010 0 1 1×2^EXP

C = 110001011010010110010110×2^(EXP-3) --> 
C = 000110001011010010110010 1 1 0×2^EXP

5_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B = 100000111011001111001010×2^EXP
A-B-C       = 011010101111111100010111 0 1 0×2^EXP
A-B-C shifted 1 bit to left
            = 110101011111111000101110 1 0 0×2^(EXP-1)
Rounded A-B-C=110101011111111000101110×2^(EXP-1)

5_2_If A-B-C is calculated without A-B being rounded:

A-B-C       = 011010101111111100010111 1 0 1×2^EXP
A-B-C shifted 1 bit to left 
            = 110101011111111000101111 0 1 0×2^(EXP-1)
Rounded A-B-C=110101011111111000101111×2^(EXP-1)

6_Example 6

A-B = 100000000011000111001010 0 0 1×2^EXP
C = 110010001100110111000000×2^(EXP-8) -->
C = 000000001100100011001101 1 1 0×2^EXP

6_1_If A-B-C is calculated after A-B is rounded:

Rounded A-B = 100000000011000111001010×2^EXP
A-B-C       = 011111110110100011111100 0 1 0×2^EXP
A-B-C shifted 1 bit to left
            = 111111101101000111111000 1 0 0×2^(EXP-1)
Rounded A-B-C=111111101101000111111000×2^(EXP-1)

6_2_If A-B-C is calculated without A-B being rounded:

A-B-C       = 011111110110100011111100 0 1 1×2^EXP
A-B-C shifted 1 bit to left
            = 111111101101000111111000 1 1 0×2^(EXP-1)
Rounded A-B-C=111111101101000111111001×2^(EXP-1)

Upvotes: 1

Views: 207

Answers (3)

Patricia Shanahan
Patricia Shanahan

Reputation: 26185

According to the paper referenced in the question, it is possible to calculate the dot product of a pair of length N vectors with a single rounding operation at the end, getting the closest representable result to the dot product.

In practice, current computers round intermediate results, which often results in an answer that is not the closest representable. At best, the rounding is to an extended format, which will tend to reduce, but not eliminate, intermediate result rounding error.

With a fused multiply-add the rounding is done N times, once for each multiply-add. Without a fused multiply-add it is done twice for each pair, once after the multiply and again after the add.

The BibTex information for the paper is:

@INPROCEEDINGS{ yao:correctly,
AUTHOR = "Tao Yao and Deyuan Gao and Xiaoya Fan and Jari Nurmi",
TITLE = "Correctly rounded architectures for Floating-Point multi-operand addition and dot-product computation",
BOOKTITLE = "ASAP'13",
PAGES = {346-355},
YEAR = {2013}, 
}

Upvotes: 3

Stefan
Stefan

Reputation: 4705

Another approach is to use interval arithmetik to control and bound the possible errors introduced by computer arithmetic. After the computation you get an interval which is guaranteed to contain the "true result". There are some implementations available for this. On the Wikipedia page for interval arithmetic there is a section on implementations which lists some libraries. In the late 1980ies I used Pacal-SC mentioned there while working as a programming tutor at Karlsruhe University :-)

Upvotes: 1

supercat
supercat

Reputation: 81169

Some architectures will evaluate the intermediate results of an expression using an 80-bit "extended precision" floating-point type, even if those results will then be stored into a 64-bit or 32-bit type. Unfortunately, today's programming languages often offer no control over that beyond allowing the choice of whether to "allow" or "forbid" it [they offer no way to request such behavior in places where it would be helpful]. It would in general be prohibitively expensive for languages to try to perform floating-point math with anything beyond an 80-bit extended-precision type.

Upvotes: 2

Related Questions