Reputation: 15387

Rules for Explicit int32 -> float32 Casting

I have a homework assignment to emulate floating point casts, e.g.:

int y = /* ... */;
float x = (float)(y);

. . . but obviously without using casting. That's fine, and I wouldn't have a problem, except I can't find any specific, concrete definition of how exactly such casts are supposed to operate.

I have written an implementation that works fairly well, but it doesn't quite match up occasionally (for example, it might put a value of three in the exponent and fill the mantissa with ones, but the "ground truth" will have a value of four in the exponent and fill the mantissa with zeroes). The fact that the two are equivalent (sorta, by infinite series) is frustrating because the bit pattern is still "wrong".

Sure, I get vague things, like "round toward zero" from scattered websites, but honestly my searches keep getting clogged C newbie questions (e.g., "What's a cast?", "When do I use it?"). So, I can't find a general rule that works for explicitly defining the exponent and the mantissa.

Help?

Upvotes: 2

Answers (3)

geometrian

Reputation: 15387

Thanks everyone for the very useful help! In particular, the rules for rounding were especially helpful!

I am pleased to say that, with the help of this question's responses, and all you glorious people, I successfully implemented the function. My final function is:

unsigned float_i2f(int x) {
    /* Apply a complex series of operations to make the cast.  Rounding was achieved with the help of my post http://stackoverflow.com/questions/9288241/rules-for-explicit-int32-float32-casting. */
    int sign, exponent, y;
    int shift, shift_is_pos, shifted_x, deshifted_x, dropped;
    int mantissa;

    if (x==0) return 0;

    sign = x<0 ? 0x80000000 : 0; //extract sign
    x = sign ? -x : x; //absolute value, sorta

    //Check how big the exponent needs to be to offset the necessary shift to the mantissa.
    exponent = 0;
    y = x;
    while (y/=2) {
        ++exponent;
    }

    shift = exponent - 23; shift_is_pos = shift >= 0; //How much to shift x to get the mantissa, and whether that shift is left or right.

    shifted_x = (shift_is_pos ? (x>>shift) : (x<<-shift)); //Shift x
    deshifted_x = (shift_is_pos ? (shifted_x<<shift) : (shifted_x>>-shift)); //Unshift it (fills right with zeros)
    dropped = x - deshifted_x; //Subtract the difference.  This gives the rounding error.

    mantissa = 0x007FFFFF & shifted_x; //Remove leading MSB (it is represented implicitly)

    //It is only possible for bits to have been dropped if the shift was positive (right).
    if (shift_is_pos) {
        //We dropped some bits.  Rounding may be necessary.
        if ((0x01<<(shift-1))&dropped ) {
            //The MSB of the dropped bits is 1.  Rounding may be necessary.

            //Kill the MSB of the dropped bits (taking into account hardware ignoring 32 bit shifts).
            if (shift==1) dropped = 0;
            else dropped <<= 33-shift;

            if (dropped) {
                //The remaining dropped bits have one or more bits set.
                goto INC_MANTISSA;
            }
            //The remaining dropped bits are all 0
            else if (mantissa&0x01) {
                //LSB is 1
                goto INC_MANTISSA;
            }
        }
    }

    //No rounding is necessary
    goto CONTINUE;

    //For incrementing the mantissa.  Handles overflow by incrementing the exponent and setting the mantissa to 0.
INC_MANTISSA:
    ++mantissa;
    if (mantissa&(0x00800000)) {
        mantissa = 0;
        ++exponent;
    }

    //Resuming normal program flow.
CONTINUE:
    exponent += 127; //Bias the exponent

    return sign | (exponent<<23) | mantissa; //Or it all together and return.
}

It solves all test cases correctly, although I'm certain it does not handle everything correctly (for example, if x is 0x80000000, then the "absolute value" section will return 0x80000000, because of overflow).

Once again, I want to thank all of you greatly for your help!

Thanks, Ian

Upvotes: 0

Michael Burr

Reputation: 340218

Since this is homework, I'll just post some notes about what I think is the tricky part - rounding when the magnitude of the integer is larger than the precision of the float will hold. It sounds like you already have a solution for the basics of obtaining the exponent and mantissa already.

I'll assume that your float representation is IEEE 754, and that rounding is performed the same way that MSVC and MinGW do: using a "banker's rounding" scheme (I'm honestly not sure if that particular rounding scheme is required by the standard; it's what I tested against though). The remaining discussion assumes the int to be converted in greater than 0. Negative numbers can be handled by dealing with their absolute value and setting the sign bit at the end. Of course, 0 needs to be handled specially in any case (because there's no msb to find).

Since there are 24 bits of precision in the mantissa (including the implied 1 for the msb), ints up to 16777215 (or 0x00ffffff) can be represented exactly. There's nothing particularly special to do other than the bit shifting to get things in the right place and calculating the correct exponent depending on the shifts.

However, if there are more than 24 bits of precision in the int value, you'll need to round. I performed the rounding using these steps:

If the msb of the dropped bits is 0, nothing more needs to be done. The mantissa and exponent can be left alone.
if the msb of the dropped bits is 1, and the remaining dropped bits have one or more bits set, the mantissa needs to be incremented. If the mantissa overflows (beyond 24 bits, assuming you haven't already dropped the implied msb), then the mantissa needs to be shifted right, and the exponent incremented.
if the msb of the dropped bits is one, and the remaining dropped bits are all 0, then the mantissa is incremented only if the lsb is 1. Handle overflow of the mantissa similarly to case 2.

Since the mantissa increment will overflow only when it's all 1's, if you're not carrying around the mantissa's msb (i.e., if you've already dropped it since it'll be dropped in the ultimate float representation), then the cases where the mantissa increment overflows can be fixed up simply by setting the mantissa to zero and incrementing the exponent.

Upvotes: 2

Lefteris

Reputation: 3256

I saw your question and remembered some code for floating point emulation I had written a long time ago. First of all a very important piece of advice for floating point numbers. Read "What Every Programmer Should know about Floating point" , it's very nice and complete guide on the subject.

As for my code I dug around and found it but I have to warn you it's ugly and since it was for a personal project (my undergrad thesis) it's not properly commented. Also the code might have certain peculiarities since it targetted an embedded system (a robot). The link to the page that explains the project and has a download link for the code is here. Don't mind the website, I am no web designer I am afraid :)

This is how I represented floating points in that project:

typedef struct
{
    union{
        struct {
           unsigned long mantissa: 23;
           unsigned long exponent: 8;
           unsigned long sign: 1;
       } float_parts;   //the struct shares same memory space as the float
                        //allowing us to access its parts with the bitfields

        float all;

    };

}_float __attribute__((__packed__));

It uses bitfields the explanation of which is I guess out of the topic scope so refer to the link if you want to learn more information.

What would interest you from in there I suppose is this function. Please note that the code is not very well written and I have not looked at it for years. Also note that since I was targeting only the specific robot's architecture the code has no checks for endianess. But in any case I hope it's of use to you.

_float intToFloat(int number)
{
    int i;
    //will hold the resulting float
    _float result;

    //depending on the number's sign determine the floating number's sign
    if(number > 0)
        result.float_parts.sign = 0;
    else if(number < 0)
    {
        number *= -1; //since it would have been in twos complements
                     //being negative and all
        result.float_parts.sign = 1;
    }
    else // 0 is kind of a special case
    {
        parseFloat(0.0,&result);
        return result;
    }

    //get the individual bytes (not considering endiannes here, since it is for the robot only for now)
    unsigned char* bytes= (unsigned char*)&number;

    //we have to get the most significant bit of the int
    for(i = 31; i >=0; i --)
    {
        if(bytes[i/8] & (0x01 << (i-((i/8)*8))))
            break;
    }


    //and adding the bias, input it into the exponent of the float
    //because the exponent says where the decimal (or binary) point is placed relative to the beginning of the mantissa
    result.float_parts.exponent = i+127;


    //now let's prepare for mantissa calculation
    result.float_parts.mantissa = (bytes[2] <<  16 | bytes[1] << 8 | bytes[0]);

    //actual calculation of the mantissa
    i= 0;
    while(!(result.float_parts.mantissa & (0x01<<22)) && i<23) //the i is to make sure that
    {                                                          //for all zero mantissas we don't
        result.float_parts.mantissa <<=1;                      //get infinite loop
        i++;
    }
    result.float_parts.mantissa <<=1;


    //finally we got the number
    return result;
}

Upvotes: 1

Rules for Explicit int32 -&gt; float32 Casting

Answers (3)

Related Questions

Rules for Explicit int32 -> float32 Casting