johni
johni

Reputation: 5568

custom floats addition, implementing math expression - C

I'm implementing a new kind of float "NewFloat" in C, it uses 32bits, it has no sign bit (only positive numbers. So the whole 32bits are used by the exponent or mantissa.

In my example, I have 6bits for the exponent (EXPBITS) and 26 for the mantissa (MANBITS). And We have an offset which is used for representing negative exponents, which is (2^(EXPBITS-1)-1).

Given a NewFloat nf1, the translation to a real number is done like this: nf1 = 2^(exponent - offset) * (1 + mantissa/2^MANBITS).

Now, given two NewFloats (nf1, nf2), each with it's (exp1, man1, exp2, man2 and the same offset), Assuming that nf1 > nf2, I can calculate the exponent and mantissa of the sum of both nf1 and nf2, and this is done like this: link

To spare your time, I found that: Exponent of the sum is: exp1 Mantissa of the sum is: man1 + 2^(exp2 - exp1 + MANBITS) + 2^(exp2 - exp1) * man2

To ease with the code, I split to work and calc separately each component of the mantissa: x = 2^(exp2 - exp1 + MANBITS) y = 2^(exp2 - exp1) * man2

I'm kind of sure that I'm not implementing right the mantissa part:

unsigned long long x = (1 << (exp2 - exp1 + MANBITS));
unsigned long long y = ((1 << exp2) >> exp1) * man2;
unsigned long long tempMan = man1;
tempMan += x + y;

unsigned int exp = exp1;                                    // CAN USE DIRECTLY EXP1.
unsigned int man = (unsigned int)tempMan;

The sum is represented like this: sum = 2^(exp1 - offset) * (1 + (man1 + x + y)/2^MANBITS).

The last thing I have to handle is the case of an overflow of the sum's mantissa. In this case, I should add 1 to the exponent and divide the whole (1 + (man + x + y)2^MANBITS) expression.

In that case, given that I only need to represent the nominator in bits, how do I do that after the division?

Is there any problem in my implementation? Which I have a feeling there is.

If you have a better way of doing this, I would be really happy to hear about it.

Please, don't ask me why I do this.. it's an exercise which I've been trying to solve for more than 10 hours.

Upvotes: 1

Views: 142

Answers (1)

chux
chux

Reputation: 153517

Code is doing signed int shifts and certainly unsigned long long is desired.

// unsigned long long x = (1    << (exp2 - exp1 + MANBITS));
   unsigned long long x = (1LLU << (exp2 - exp1 + MANBITS));

Notes:

Suggest more meaningful variable names like x_mantissa.

Rounding not implemented. Rounding can cause a need for increase in exponent.

Overflow not detected/implemented.

Sub-normals not implemented. Should NewFloat not use them, not that a-b --> 0 does not mean a == b.

Upvotes: 3

Related Questions