Reputation: 35
So I try to use Arbitrary Precision Arithmetics to represent double via exponent and mantissa and afterwards get the same double.
I try to get mantissa and exponent using union:
typedef struct s_arith
{
long mant:64;
short exp:15;
char sign:1;
} t_arith;
union u_dbl
{
t_arith arith;
long double ldbl;
};
Afterwards I try to get initial double using the following formula: (-1)^S * M * 2 ^ E. But it doesn't work for me. What am I doing wrong?
Upvotes: 0
Views: 92
Reputation: 19100
There are multiple problems with your question:
So I try to use Arbitrary Precision Arithmetics
long double
is not arbitrary precision, it's simply extended precision. Arbitrary precision is a mechanism to get arbitrarily many significant digits (up to memory available), while here you only have 64.
long mant:64;
On a 32-bit system this won't compile, because long
will likely be 32-bit. You should use unsigned long long
or uint64_t
(note the unsignedness: signed bit fields can lead to surprises). And actually, this doesn't have to be a bit field if you use uint64_t
: the type itself is exactly 64 bit wide.
1.0
is 0x3fff
. So you can't simply put a 0
in this field and expect the number to be r×2⁰.With this in mind we can write an example with your definitions amended:
#include <stdint.h>
#include <math.h>
typedef struct s_arith
{
uint64_t mant;
uint16_t exp:15;
uint8_t sign:1;
} t_arith;
union u_dbl
{
t_arith arith;
long double ldbl;
};
#include <stdio.h>
int main()
{
const uint64_t mant=0xc000000000000000;
const int expo=5;
const union u_dbl d={mant,0x3fff+expo,0};
printf("%Lg\n", d.ldbl);
}
This outputs 48
on my system (32-bit x86 Linux with gcc), as expected — because highest two bits of the significand are set, and others cleared, and the value is 1.5×2⁵.
But actually, to portably load a floating-point number from an integer significand and an exponent, you should use the already existing function from <math.h>
header: ldexp
. Here's how you could do this:
#include <stdint.h>
#include <stdio.h>
#include <math.h>
int main()
{
const uint64_t mant=0xc000000000000000;
const int expo=5;
const long double x=ldexp(mant, expo-63);
printf("%Lg\n", x);
}
The -63
shift here compensates for the fact that the first argument of ldexp
is taken as the value to be multiplied by 2^(second_argument)
, instead of simply being put into the significand field of the resulting number representation.
Note also that even on x86 architecture, not all compilers have 80-bit long double
: e.g. Microsoft ones have 64-bit long double
— same as simply double
.
Upvotes: 1