bodokaiser
bodokaiser

Reputation: 15752

How to find (optimal) integer ratio of different precision?

If I have a variable m of type uint32 and r of type uint16 as well as a constant float64 with i.e. value f=0.5820766091346741. How do I find m,r which satisfy f=r/m?

Similar as Fraction.limit_denominator from python.

This github repo contains various best-rational approximation algorithms but only limits the denominator.

Upvotes: 1

Views: 921

Answers (4)

chux
chux

Reputation: 154075

How do I find m,r which satisfy f=r/m?

= implies exact.

To do this exactly, if possible, see below. This approach does not attempt a best fit if an exact solution is not possible as that would not satisfy f=r/m.


All finite floating point values are exact. "0.5820766091346741" saved in f may give f a nearby value, yet the value in f is exact.

Given the base of the floating point number (very commonly 2) they all can be represented exactly with: "integer/(baseexponent)".

With binary64, the largest exponent needed is about (1023 + 53).

As OP wants the result to fit in as 32-bit r and 16-bit m, it is readily understandable that most float64 (64-bits) will not have an exactly solution - just not enough combinations to save the result.

Algorithm below in commented C assuming base 2.

// return true on success
bool fraction(double d, uint32_t *r, uint16_t *m) {
  if (d < 0.0 || isnan(d) || d > UINT32_MAX) {
    return false;
  }

  // Scale d to extract, hopefully a 32+15 bit integer
  uint16_t power_of_2 = 32768; // largest power-of-2 in m
  d *= power_of_2;
  uint64_t ipart = (uint64_t) d;
  // Even after scaling, `d` has a fractional part.
  if (d != ipart) {
    return false;  // value has unrepresentable precision.
  }

  // while big and even, reduce the fraction
  while (ipart > UINT32_MAX && (ipart % 2 == 0)) {
    power_of_2 /= 2;
    ipart /= 2;
  }

  // If reduction was insufficient ...
  if (ipart > UINT32_MAX) {
    return false; // value has unrepresentable precision.
  }

  *r = (uint32_t) ipart;
  *m = power_of_2;
  return true;  // Success!
}

Upvotes: 2

aka.nice
aka.nice

Reputation: 9512

I don't give you an algorithm, because, IMO, continued fractions is the right path.

But I wanted to illustrate how well this representation of floating point does fit 64bits IEEE754. So i've played a bit with the concept in Smalltalk (Squeak 64 bits).

There are only 48 bits for r/m representation, with many combinations representing the same value (1/1=2/2=... 1/2=2/4=3/6=...) while there are already 2^53 different 64bits float in the interval [0.5,1.0). So we can say that most of the time, we are not going to match f exactly. The problem is then to find a pair (r/m) that rounds nearest to f.

I can't reasonably play with 48bits, but I can with half precision, and gather all the uint8/uint16 combinations:

v := Array new: 1<<24.
0 to: 1<<8-1 do: [:r |
    0 to: 1<<16-1 do: [:m |
        v at: (m<<8+r+1) put: ([r asFloat/m asFloat]
            on: ZeroDivide do: [:exc | exc return: Float infinity])]].
s := v asSet sorted.
s size-2.

Except 0 and inf, that's about 10,173,377 different combinations out of 16,777,216.

I'm interested in the gap between two consecutive representable floats:

x := s copyFrom: 2 to: s size - 1.
y := (2 to: s size-1) collect: [:i |  (s at: i) - (s at: i-1) / (s at: i) ulp].

the minimum is

u := y detectMin: #yourself.

about 2.71618435e8 ulp.

Let's see how the numerator and denominator are formed:

p := y indexOf: u.
{((v  indexOf: (x at: p)) - 1) hex.
 ((v  indexOf: (x at: p-1)) - 1) hex}.

result in #('16rFDFFFE' '16rFEFFFF') the first 4 digits encode den (m), the last two num (r).

So the minimum gap is obtained for

s1 := (1<<8-1) / (1<<8-1<<8-1).
s2 := (1<<8-2) / (1<<8-2<<8-1).
s2 asFloat - s1 asFloat / s2 asFloat ulp = u.

It is around the value 1/256 (or somewhere near).

We can conjecture that the minimum gap for 48 bits repersentation is

s1 := (1<<16-1) / (1<<16-1<<16-1).
s2 := (1<<16-2) / (1<<16-2<<16-1).
s2 asFloat - s1 asFloat / s2 asFloat ulp.

That is around 16 ulp, not that bad, and the maximum density is around 1/65536 (or somewhere near).

What will be the density near 0.5 as in your example? For 24 bits representation:

h := x indexOf: 0.5.

is 10133738. Let's inspect the precision in the neighbourhood:

k := (h to: h +512) detectMin: [:i | (y at: i)].
u2 := y at: k.

That's 3.4903102168e10 ulp (about 128 times less density). It is obtained for:

s1 := (1<<8-1) / (1<<8-1<<1-1).
s2 := (1<<8-2) / (1<<8-2<<1-1).
s2 asFloat- s1 asFloat / s2 asFloat ulp = u2.

So, with 48bits, we can expect a density of about

s1 := (1<<16-1) / (1<<16-1<<1-1).
s2 := (1<<16-2) / (1<<16-2<<1-1).
s2 asFloat- s1 asFloat / s2 asFloat ulp.

that is 524320 ulp, or a precision of approximately 5.821121362714621e-11.

Edit: What about the worst precision?

In the zone of best density:

q := (p-512 to:p+512) detectMax: [:i | y at: i].
{((v  indexOf: (x at: q)) - 1) hex.
 ((v  indexOf: (x at: q-1)) - 1) hex.}.

That is #('16rFEFFFF' '16r10001'), or in other word, just before the best precision, we have locally the worst: w := y at: q. which is 6.8990021713e10 ulp for those numbers:

s2 := (1<<8-1) / (1<<8-1<<8-1).
s1 := (1) / (1<<8).
s2 asFloat - s1 asFloat / s2 asFloat ulp = w.

Translated to 48 bits, that is about 1.048592e6 ulp:

s2 := (1<<16-1) / (1<<16-1<<16-1).
s1 := (1) / (1<<16).
s2 asFloat - s1 asFloat / s2 asFloat ulp.

And near 0.5, the worst is about 8.847936399549e12 ulp for 24 bits:

j := (h-512 to: h +512) detectMax: [:i | (y at: i)].
w2 := y at: j.
s2 := (1<<8-1) / (1<<8-1<<1-1).
s1 := (1) / (1<<1).
s2 asFloat- s1 asFloat / s2 asFloat ulp = w2.

or translated to 48 bits, 3.4360524818e10 ulp:

s2 := (1<<16-1) / (1<<16-1<<1-1).
s1 := (1) / (1<<1).
s2 asFloat- s1 asFloat / s2 asFloat ulp.

That's about 3.814784579114772e-6 of absolute precision, not that good.

Before adopting such a representation, it would be good to know what is the domain of f, and know about average precision, and worst case precision achievable in this domain.

Upvotes: 0

bodokaiser
bodokaiser

Reputation: 15752

There is a paper by David T. Ashley et al. which proposes an algorithm to find a rational approximation by two integers with different precision.

I implemented a basic version which does not contain the whole complexity of the referred paper 1.

The basic idea is to convert the float number into a continuous fraction and then looking for the highest order convergent which satisfies the constraints. See wiki for an introduction on convergents.

However the referred paper describes a more sophisticated approach on applying constraints on the integer rations (see section 5) which uses an analogy to lattice structures 1.

Upvotes: 1

FDavidov
FDavidov

Reputation: 3675

The straightforward answer would be:

     ROUND(f * 10^8)
f = ----------------
         10^8

Then, you can implement a small loop that attempts to divide both numerator and denominator by prime numbers (starting from 2 and up). Something like (code not checked of course):

var m = f * 10^8 ;
var r = 10^8     ;
var Prime_Numbers = [2,3,5,7,11,13,17,19,....] ;

for (var I = 0 ; I < Prime_Numbers.length ; I++) {

    if ((Prime_Numbers[I] > m) ||
        (Prime_Numbers[I] > r)    ) {

       break;
    }

    if (((m % Prime_Numbers[I]) == 0) &&
         (r % Prime_Numbers[I]) == 0)    ) {
          m = m / Prime_Numbers[I] ;
          r = r / Prime_Numbers[I] ;
    }

console.log("Best m is: " + m) ;
console.log("Best r is: " + r) ;
:
:
}

Now, the question would be how many primary numbers I should include in the list?

Hard to say, but intuitively not too many... I would say it would depend on how rigorous you are about OPTIMAL.

Hope this gives you some direction.

Cheers!!

EDIT:

Thinking a little bit further, to always get the ABSOLUTE OPTIMAL values, you need to include all primary number up to half the max value you wish as precision. For instance, if tour precision needs to be 8 digits (99999999), you need to include all primary numbers up to (99999999/2).

EDIT 2:

Added an exit condition in the loop.

Upvotes: 1

Related Questions