Reputation: 13
I'm trying to implement signed unsigned multiplication in C using fixed point arithmetic, but I get a incorrect result. I can't imagine how to solve this problem. I think there is some problem in the bit extension. Here is the piece of code:
int16_t audio_sample=0x1FF; //format signed Q1.8 -> Value represented=-0.00390625
uint8_t gain=0xA; //format unsigned Q5.2 -> Value represented = 2.5
int16_t result= (int16_t)(((int16_t)((int32_t)audio_sample * (int32_t) gain);
printf("%x",result);
The result from printf
is 0x13F6
, which is of course the result from 0x1FF*0xA
, but the fixed-point arithmetics said that the correct results would be 0x3FF6
, considering the proper bit-extension. 0x3FF6
in Q6.10 format represent -0.009765625=-0.00390625*2.5
.
Please help me find my mistake.
Thank in advance.
Upvotes: 1
Views: 370
Reputation: 6087
It is best to think of fixed-point as a matter of scaling, and to express your calculation simply and clearly in terms of numbers — rather than bits. (Example)
A Q1.8 or Q5.2 number in AMD Q notation is a real number scaled by a factor of 28 or 22 respectively.
But C doesn't have 9 or 7-bit number types. Your int16_t
and uint8_t
variables have enough range to store such numbers. But for arithmetic operations, it is unwise to use unsigned integers, or to mix signed and unsigned types. int
has enough range and avoids some efficiency pitfalls.
int audio_sample = -0.00390625*256; // Q1.8
int gain = 2.5*4; // Q5.2
The product of numbers scaled by 28 and 22 has a scale of 210.
int result = audio_sample * gain; // Q6.10
To convert back to the real value, divide by the scaler.
printf("%lg * %lg = %lg\n",
(double)audio_sample/256,
(double)gain/4,
(double)result/1024);
Please help me find my mistake.
The mistake was in assigning 0x1FF
to audio_sample
, instead of -1
. 0x1FF
is the unsigned truncation of the 9-bit two's-complement value -1. But audio_sample
is wider and would require more leading 1
bits. It would have been clearer and safer to express your intent by assigning -0.00390625*256
to audio_sample
.
the fixed-point arithmetics said that the correct results would be 0x3FF6, considering the proper bit-extension
0x3FF6
is the unsigned 14-bit truncation of the correct two's complement answer. But the result requires 16-bits so you're probably looking for value, 0xFFF6
.
printf("unsigned Q6.10: 0x%x\n", (unsigned)result & 0xFFFF);
Upvotes: 0
Reputation: 9804
You should use unsigned types here. The representation is in your head (or the comments), not in the data types in the code.
2's complement means the 1
on the left is theoretically continued forever. e.g. 0x1FF
in Q1.8 is the same as 0xFFFF
in Q8.8 (-1 / 256
).
If you have a 16bit integer, you cannot have Q1.8, it will always be Q8.8, the machine will not ignore the other bits. So, 0x1FF
in Q1.8 should be 0xFFFF
in Q8.8. The 0xA
in Q5.2 do not change in Q6.2
.
0xFFFF * 0xA = 0x9FFF6
, cut away the overflow (therefore use unsigned) and you have 0xFFF6
in Q6.10, which is -10 / 1024
, which is your expected result.
Upvotes: 4