How to convert to/from an 8-bit float representation?

Question

I'm currently creating an emulator for a hypothetical CPU. The CPU has 16 8-bit registers which can either represent a signed byte or an 8-bit float.

Both SByte and FByte contain a byte member variable.

I currently have worked out how to get the real value of the floating byte using the following:

FByte = SEEEEMMM

value = (-1)^S + 1.M^(E-7)

S = Sign bit
M = Mantissa
E = Exponent

How would I go about converted a given double value (e.g. -3.562) into a float representation (as SEEEEMMM).

Thanks in advance!

EDIT: I currently know how to do this in theory - write it in base-2 scientific notation and binary representation but to do it that way in my program would require using String manipulation whereas I'd rather keep String intermediaries out of it.

ajb · Accepted Answer

The basic plan for converting a double to your float representation should be:

Convert the double to a long using doubleToLongBits. This gives the IEEE 754 representation of the double.
Extract the parts of the double by using bit operations on the doubleToLongBits result. Bit 63 is the sign bit. Bits 62-52 are the biased exponent. Bits 51-0 are the mantissa.
The upper 3 bits of the mantissa (bits 51-49 of the original float) will become your resulting 3-bit mantissa. (There is an implied 1 in both formats.) However, you'll have to decide how to handle rounding, if bit 48 of the original float is 1. If bits 51-49 are 0b111 and you decide you need to round up, write your code very carefully, because now the mantissa goes from [1].111 to [1]0.000, which means you will need to shift one to the right (to get [1].000), which will impact the resulting exponent. (I'm using [1] to indicate an implied 1 bit in the mantissa.)
To get the new exponent, take the original biased exponent, subtract 1023, and add 7. 1023 is the bias of an IEEE 754 double, and 7 appears to be the bias of your floating-point type. The result will be the new exponent, but it could be out of range. [Also, you might have to add another 1 to the new exponent if you round up, as noted above.]
The sign bit of the result is the sign bit of the original double. (I'm assuming that you meant the formula to be (-1)^S * 1.M^(E-7), with * instead of +.)

See https://en.wikipedia.org/wiki/IEEE_floating_point for more information about the format of a double.

How to convert to/from an 8-bit float representation?

Answers (1)

Related Questions