base12
base12

Reputation: 185

32-bit fixed point arithmetic 1x1 does not equal 1

I'm implementing 32-bit signed integer fixed-point arithmetic. The scale is from 1 to -1, with INT32_MAX corresponding to 1. I'm not sure whether to make INT32_MIN or -INT32_MAX correspond to -1, but that's an aside for now.

I've made some operations to multiply and round, as follows:

#define mul(a, b) ((int64_t)(a) * (b))
#define round(x) (int32_t)((x + (1 << 30)) >> 31)

The product of two numbers can then be found using round(mul(a, b)).

The issue comes up when I check the identity. The main problem is that 1x1 is not 1. It's INT32_MAX-1. That's obviously not desired as I would like bit-accuracy. I suppose this would affect other nearby numbers so the fix isn't a case of just adding 1 if the operands are both INT32_MAX. Additionally, -1x-1 is not -1, 1x-1 is not -1, and -1x-1=-1. So none of the identities hold up.

Is there a simple fix to this, or is this just a symptom of using fixed point arithmetic?

Upvotes: 1

Views: 281

Answers (1)

Eric Postpischil
Eric Postpischil

Reputation: 222660

In its general form, a fixed-point format represents a number x as an integer xs. Commonly, s is a power of some base b, s = bp. For example, we might store a number of dollars x as x•100, so $3.45 might be stored as 345. Here we can easily see why this is called a “fixed-point” format: The stored number conceptually has decimal point inserted as a fixed position, in this case two to the left of the rightmost digit: “345” is conceptually “3.45”. (This may also be called a radix point rather than a decimal point, allowing for cases when the base b is not ten. And p specifies where the radix-point is inserted, p base-b digits from the right.)

If you make INT_MAX represent 1, then you are implicitly saying s = INT_MAX. (And, since INT_MAX is not a power of any other integer, we have b = INT_MAX and p = 1.) Then −1 must be represented by −1•INT_MAX = -INT_MAX. It would not be represented by INT_MIN (except in archaic C implementations where INT_MIN = -INT_MAX).

Given s = INT_MAX, shifting by 31 bits is not a correct way to implementation multiplication. Given two numbers x and y with representations a and b, the representation of xy is computed by multiplying the representations a and b and dividing by s:

  • a represents x, so a = xs.
  • b represents y, so b = ys.
  • Then ab/s = xsys/s = xys, and xys represents xy.

Shifting by 31 divides by 231, so that is not the same as dividing by INT_MAX. Also, division is generally slow in hardware. You may be better off choosing s = 230 instead of INT_MAX. Then you could shift by 30 bits.

When calculating ab/s, we often want to round. Adding ½s to the product before dividing is one method of rounding, but it is likely not what you want for negative products. You may want to consider adding −½s if the product is negative.

Upvotes: 5

Related Questions