user200783
user200783

Reputation: 14348

Can "floor division" of floating-point numbers (e.g. in Python) cause innaccuracy?

Guido van Rossum has written a blog post explaining why, in Python, integer division (for example, a // b) is "floor division" - the quotient is rounded towards negative infinity. Correspondingly, the sign of a % b matches the sign of b.

This differs from C, where the quotient is rounded towards zero and the result of a % b has the sign of a.

Python also uses floor division, and the corresponding "sign-of-modulo matches sign-of-divisor", for floating-point numbers. The blog post claims that this can be inaccurate in certain cases (where C's "sign-of-modulo matches sign-of-dividend" would be accurate). Is this true? Are there any concrete examples?

Upvotes: 4

Views: 118

Answers (1)

Eric Postpischil
Eric Postpischil

Reputation: 222828

Introduction

The following proof is longer than I want it to be, but this question has gone several days without being answered, and it deserves an answer.

Before going into the proof, let me address this intuitively. If we define modulo to return a result for x % y that is in [−y/2, +y/2] (for positive y), then the result is always either x or is reduced by adding (positive or negative) multiples of y. If the result is x, it is representable since x is given in a representable form. If the result is reduced, then it is necessarily a multiple of the position value of the low digit in y, and its greatest digit position is no greater than the greatest digit position in y, and hence it fits in the floating-point format and is representable.

On the other hand, if we define modulo to return a result for x % y that is in [0, y), then a small negative x must be increased by adding y. When x is small, it may have digits in lower positions than y, and, when it does, the result of adding y must have a non-zero digit in the lowest position that x does, but it must also have a non-zero digit in a higher position than the small x does (because y is adding a digit in a higher position). Therefore, the result needs more digits than fit in the floating-point format, and the result is not representable.

A simple example is −2−60 % 1. The mathematical result is 1−2−60, but this cannot be represented with just 53 bits in the significand; it needs bits with position values from 2−1 to 2−60, which requires 60 bits.

Symmetric Modulo Is Exact

First, let’s see that symmetric modulo defined so that x % y is in [−y/2, +y/2] for positive y always has a representable result. I will also assume x is positive, but the arguments for negative x and/or negative y are symmetric, and the results for x = 0 are trivial.

x % y is defined to be r such that r = xqy for some integer q, and typically we define some constraints on r or q so that r is uniquely determined (or perhaps at least usually uniquely determined with some flexibility when the result is at an endpoint of some interval). Since q is an integer, if both x and y are integer multiples of some number g (which might or might not be an integer), then r is also an integer multiple of g.

In a floating-point format, a number is represented using a sign, a base b (which is an integer greater than 1), a fixed number p of base-b digits, and an exponent e. The number represented is ± digits × be. Let’s write the individual digits as dp−1dp−2dp−3d2d1d0.

Consider the inputs x and y. Using xi to denote the base-b digits used in representing x, and ex for the exponent used in representing x, and similarly for y, we have x = xp−1x0 × bex and y = yp−1y0 × bey.

Observe that both x and y are multiples of the lesser of bex and bey, and so r must be too.

If beybex, then r is a multiple of bey. Also, |r| is necessarily less than y. This implies we can represent r as ± rp−1r0 × beyr is small enough that these digits with the exponent ey are large enough to represent its value, and, because it is a multiple of bey, it does not need any digits with a smaller exponent. Thus, r is representable in the floating-point format.

Now consider bex < bey. Also suppose that y is normalized, by which we mean that its leading digit, yp−1, is not zero. (If it is zero, find a normalized representation of y by decreasing its exponent to shift a non-zero digit into the leading position. Then the above paragraph applies. If y has no non-zero digits, it is zero, and x % y is not defined.) Then x < y. In this case, r is either x or xy, because one of these two is in [−y/2, +y/2]. If r is x, then it is representable since x is representable. If r is xy, then x ≥ ½ y, and |r| ≤ x. Since r is a multiple of bex and |r| < x, we must be able to represent r as ± rp−1r0 × bex.

Asymmetric Modulo May Be Inexact

The above proof tells us that symmetric modulo is exact because the result is always either the unchanged x or x reduced in magnitude sufficiently that all the required digits fit in the floating-point format. And this tells us how to break modulo defined such that x % y is in [0, y): Select an x that must be increased in magnitude.

We have y = yp−1y0 × bey. Let y be normalized, as described above. For x, select any value that is negative, has ex < ey, and is not a multiple of bey (meaning that at least one of its digits from xey−1−ex to x0 is not zero). In some cases where the leading digit of y is 1 and a borrow from it occurs, the result may be representable. Otherwise, the greatest digit position it needs is the same as y’s greatest digit position, and it needs digits below bey, and therefore it is not representable.

Upvotes: 2

Related Questions