Reputation: 185
I'm implementing 32-bit signed integer fixed-point arithmetic. The scale is from 1 to -1, with INT32_MAX
corresponding to 1. I'm not sure whether to make INT32_MIN
or -INT32_MAX
correspond to -1, but that's an aside for now.
I've made some operations to multiply and round, as follows:
#define mul(a, b) ((int64_t)(a) * (b))
#define round(x) (int32_t)((x + (1 << 30)) >> 31)
The product of two numbers can then be found using round(mul(a, b))
.
The issue comes up when I check the identity.
The main problem is that 1x1 is not 1. It's INT32_MAX-1
. That's obviously not desired as I would like bit-accuracy. I suppose this would affect other nearby numbers so the fix isn't a case of just adding 1 if the operands are both INT32_MAX
.
Additionally, -1x-1 is not -1, 1x-1 is not -1, and -1x-1=-1. So none of the identities hold up.
Is there a simple fix to this, or is this just a symptom of using fixed point arithmetic?
Upvotes: 1
Views: 281
Reputation: 222660
In its general form, a fixed-point format represents a number x as an integer x•s. Commonly, s is a power of some base b, s = bp. For example, we might store a number of dollars x as x•100, so $3.45 might be stored as 345. Here we can easily see why this is called a “fixed-point” format: The stored number conceptually has decimal point inserted as a fixed position, in this case two to the left of the rightmost digit: “345” is conceptually “3.45”. (This may also be called a radix point rather than a decimal point, allowing for cases when the base b is not ten. And p specifies where the radix-point is inserted, p base-b digits from the right.)
If you make INT_MAX
represent 1, then you are implicitly saying s = INT_MAX
. (And, since INT_MAX
is not a power of any other integer, we have b = INT_MAX
and p = 1.) Then −1 must be represented by −1•INT_MAX
= -INT_MAX
. It would not be represented by INT_MIN
(except in archaic C implementations where INT_MIN
= -INT_MAX
).
Given s = INT_MAX
, shifting by 31 bits is not a correct way to implementation multiplication. Given two numbers x and y with representations a and b, the representation of xy is computed by multiplying the representations a and b and dividing by s:
Shifting by 31 divides by 231, so that is not the same as dividing by INT_MAX
. Also, division is generally slow in hardware. You may be better off choosing s = 230 instead of INT_MAX
. Then you could shift by 30 bits.
When calculating ab/s, we often want to round. Adding ½s to the product before dividing is one method of rounding, but it is likely not what you want for negative products. You may want to consider adding −½s if the product is negative.
Upvotes: 5