Dexter
Dexter

Reputation: 1337

spliting 64 bit value to fit in argument type of double

I have a function for which i cannot change the syntax, say this is some library function that i am calling:

void schedule(double _val);

void caller() {
   uint64_t value = 0xFFFFFFFFFFFFFFF;
   schedule(value);
}

as the function schedule accepts double as the argument type, in cases where the value of the argument is greater that 52 bits ( considering double stores mantissa as 52 bit value) i loose the precision in such cases.

what i intend to do is , if the value if greater than the max value a double can hold, i need to loop for the remaining value, so that in the end it sums up to correct value.

    void caller() {
       uint64_t value = 0xFFFFFFFFFFFFFFF;
       for(count = 0; count < X ; count++) {
           schedule(Y);
       }
    }

i need to extract X and Y from variable 'value'. How can this be achieved ? My objective is not to loose the precision because of the type casting.

Upvotes: 2

Views: 170

Answers (2)

Eric Postpischil
Eric Postpischil

Reputation: 222312

If your problem is only losing precision in caller and not in schedule, then no loop is needed:

void caller() {
    uint64_t value = 0xFFFFFFFFFFFFFFF;
    uint64_t modulus = (uint64_t) 1 << 53;
    schedule(value - value % modulus);
    schedule(value % modulus)
}

In value - value % modulus, only the high 11 bits are significant, because the low 53 have been cleared. So, when it is converted to double, there is no error, and the exact value is passed to schedule. Similarly, value % modulus has only 53 bits and is converted to double exactly.

(The encoding of the significand of an IEEE-754 64-bit binary floating-point object has 52 bits, but the actual significand has 53 bits, due to the implicit leading bit.)

Note: The above may result in schedule being called with an argument of zero, which we have not established is permitted. If that is a problem, such a call should be skipped.

Upvotes: 2

AnT stands with Russia
AnT stands with Russia

Reputation: 320381

If N is the max integral value your double can represent precisely, then obviously you can use

Y = N

and

X = amount / Y

(assuming integral division). Once you finished iterating over X you still have to schedule the remainder

R = amount % Y

Just keep in mind that all integral calculations have to be performed within the domain of uint64_t type, i.e. you have to add proper suffix to the constants (UL or ULL), or use type casts to uint64_t or use intermediate variables of type uint64_t.

Of course, if your program doesn't really care how many times schedule is called as long as the total is correct, then you can use virtually any value for N, as long as it can be represented precisely. For example, you can simply set N = 10000.

On the other hand, if you want to minimize the number of schedule calls, then it be worth noting that due to "implicit 1" rule the max integer that can be represented precisely in 52 bit mantissa is (1 << 53) - 1.

Upvotes: 0

Related Questions