vlpsk
vlpsk

Reputation: 35

Arbitrary Precision Arithmetics for Double

So I try to use Arbitrary Precision Arithmetics to represent double via exponent and mantissa and afterwards get the same double.

I try to get mantissa and exponent using union:

typedef struct  s_arith
{
    long            mant:64;
    short           exp:15;
    char            sign:1;

}               t_arith;

union u_dbl
{
    t_arith         arith;
    long double     ldbl;
};

Afterwards I try to get initial double using the following formula: (-1)^S * M * 2 ^ E. But it doesn't work for me. What am I doing wrong?

Upvotes: 0

Views: 92

Answers (1)

Ruslan
Ruslan

Reputation: 19100

There are multiple problems with your question:

  1. So I try to use Arbitrary Precision Arithmetics

long double is not arbitrary precision, it's simply extended precision. Arbitrary precision is a mechanism to get arbitrarily many significant digits (up to memory available), while here you only have 64.

  1. long mant:64;

On a 32-bit system this won't compile, because long will likely be 32-bit. You should use unsigned long long or uint64_t (note the unsignedness: signed bit fields can lead to surprises). And actually, this doesn't have to be a bit field if you use uint64_t: the type itself is exactly 64 bit wide.

  1. In the x87 80-bit format, as in all IEEE 754 binary formats, exponent is biased. In particular, in the 80-bit extended precision x87 format the biased exponent to represent 1.0 is 0x3fff. So you can't simply put a 0 in this field and expect the number to be r×2⁰.

With this in mind we can write an example with your definitions amended:

#include <stdint.h>
#include <math.h>

typedef struct  s_arith
{
    uint64_t        mant;
    uint16_t        exp:15;
    uint8_t         sign:1;

}               t_arith;

union u_dbl
{
    t_arith         arith;
    long double     ldbl;
};

#include <stdio.h>

int main()
{
    const uint64_t mant=0xc000000000000000;
    const int expo=5;
    const union u_dbl d={mant,0x3fff+expo,0};
    printf("%Lg\n", d.ldbl);
}

This outputs 48 on my system (32-bit x86 Linux with gcc), as expected — because highest two bits of the significand are set, and others cleared, and the value is 1.5×2⁵.

But actually, to portably load a floating-point number from an integer significand and an exponent, you should use the already existing function from <math.h> header: ldexp. Here's how you could do this:

#include <stdint.h>
#include <stdio.h>
#include <math.h>

int main()
{
    const uint64_t mant=0xc000000000000000;
    const int expo=5;
    const long double x=ldexp(mant, expo-63);
    printf("%Lg\n", x);
}

The -63 shift here compensates for the fact that the first argument of ldexp is taken as the value to be multiplied by 2^(second_argument), instead of simply being put into the significand field of the resulting number representation.

Note also that even on x86 architecture, not all compilers have 80-bit long double: e.g. Microsoft ones have 64-bit long double — same as simply double.

Upvotes: 1

Related Questions