Reputation: 2469

How to shift a floating-point value to the nearest one that can be represented exactly in a specific number of decimal places?

Is there an algorithm in C++ that will allow me to, given a floating-point value V of type T (e.g. double or float), returns the closest value to V in a given direction (up or down) that can be represented exactly in less than or equal to a specified number of decimal places D ?

For example, given

T = double 
V = 670000.08267799998 
D = 6

For direction = towards +inf I would like the result to be 670000.082678, and for direction = towards -inf I would like the result to be 670000.082677

This is somewhat similar to std::nexttoward(), but with the restriction that the 'next' value needs to be exactly representable using at most D decimal places.

I've considered a naive solution involving separating out the fractional portion and scaling it by 10^D, truncating it, and scaling it again by 10^-D and tacking it back onto the whole number portion, but I don't believe that guarantees that the resulting value will be exactly representable in the underlying type.

I'm hopeful that there's a way to do this properly, but so far I've been unable to find one.

Edit: I think my original explanation didn't properly convey my requirements. At the suggestion of @patricia-shanahan I'll try to describing my higher-level goal and then reformulate the problem a little differently in that context.

At the highest level, the reason I need this routine is due to some business logic wherein I must take in a double value K and a percentage P, split it into two double components V1 and V2 where V1 ~= P percent of K and V1 + V2 ~= K. The catch is that V1 is used in further calculations before being sent to a 3rd party over a wire protocol that accepts floating-point values in string format with a max of D decimal places. Because the value sent to the 3rd party (in string format) needs to be reconcilable with the results of the calculations made using V1 (in double format) , I need to "adjust" V1 using some function F() so that it is as close as possible to being P percent of K while still being exactly representable in string format using at most D decimal places. V2 has none of the restrictions of V1, and can be calculated as V2 = K - F(V1) (it is understood and acceptable that this may result in V2 such that V1 + V2 is very close to but not exactly equal to K).

At the lower level, I'm looking to write that routine to 'adjust' V1 as something with the following signature:

double F(double V, unsigned int D, bool roundUpIfTrueElseDown);

where the output is computed by taking V and (if necessary, and in the direction specified by the bool param) rounding it to the Dth decimal place.

My expectation would be that when V is serialized out as follows

const auto maxD = std::numeric_limits<double>::digits10;
assert(D <= maxD); // D will be less than maxD... e.g. typically 1-6, definitely <= 13
std::cout << std::fixed 
          << std::setprecision(maxD) 
          << F(V, D, true);

then the output contains only zeros beyond the Dth decimal place.

It's important to note that, for performance reasons, I am looking for an implementation of F() that does not involve conversion back and forth between double and string format. Though the output may eventually be converted to a string format, in many cases the logic will early-out before this is necessary and I would like to avoid the overhead in that case.

Upvotes: 2

Answers (4)

chux

Reputation: 154262

Total re-write.

Based on OP's new requirement and using power-of-2 as suggested by @Patricia Shanahan, simple C solution:

double roundedV = ldexp(round(ldexp(V, D)),-D);  // for nearest
double roundedV = ldexp(ceil (ldexp(V, D)),-D);  // at or just greater
double roundedV = ldexp(floor(ldexp(V, D)),-D);  // at or just less

The only thing added here beyond @Patricia Shanahan fine solution is C code to match OP's tag.

Upvotes: 1

Patricia Shanahan

Reputation: 26185

This is a sketch of a program that does what is requested. It is presented mainly to find out whether that is really what is wanted. I wrote it in Java, because that language has some guarantees about floating point arithmetic on which I wanted to depend. I only use BigDecimal to get exact display of doubles, to show that the answers are exactly representable with no more than D digits after the decimal point.

Specifically, I depended on double behaving according to IEEE 754 64-bit binary arithmetic. That is likely, but not guaranteed by the standard, for C++. I also depended on Math.pow being exact for simple exact cases, on exactness of division by a power of two, and on being able to get exact output using BigDecimal.

I have not handled edge cases. The big missing piece is dealing with large magnitude numbers with large D. I am assuming that the bracketing binary fractions are exactly representable as doubles. If they have more than 53 significant bits that will not be the case. It also needs code to deal with infinities and NaNs. The assumption of exactness of division by a power of two is incorrect for subnormal numbers. If you need your code to handle them, you will have to put in corrections.

It is based on the concept that a number that is both exactly representable as a decimal with no more than D digits after the decimal point and is exactly representable as a binary fraction must be representable as a fraction with denominator 2 raised to the D power. If it needs a higher power of 2 in the denominator, it will need more than D digits after the decimal point in its decimal form. If it cannot be represented at all as a fraction with a power-of-two denominator, it cannot be represented exactly as a double.

Although I ran some other cases for illustration, the key output is:

670000.082678 to 6 digits Up: 670000.09375 Down: 670000.078125

Here is the program:

import java.math.BigDecimal;

public class Test {
  public static void main(String args[]) {
    testIt(2, 0.000001);
    testIt(10, 0.000001);
    testIt(6, 670000.08267799998);
  }

  private static void testIt(int d, double in) {
    System.out.print(in + " to " + d + " digits");
    System.out.print(" Up: " + new BigDecimal(roundUpExact(d, in)).toString());
    System.out.println(" Down: "
        + new BigDecimal(roundDownExact(d, in)).toString());
  }

  public static double roundUpExact(int d, double in) {
    double factor = Math.pow(2, d);
    double roundee = factor * in;
    roundee = Math.ceil(roundee);
    return roundee / factor;
  }

  public static double roundDownExact(int d, double in) {
    double factor = Math.pow(2, d);
    double roundee = factor * in;
    roundee = Math.floor(roundee);
    return roundee / factor;
  }
}

Upvotes: 2

rici

Reputation: 241931

In general, decimal fractions are not precisely representable as binary fractions. There are some exceptions, like 0.5 (½) and 16.375 (16⅜), because all binary fractions are precisely representable as decimal fractions. (That's because 2 is a factor of 10, but 10 is not a factor of 2, or any power of two.) But if a number is not a multiple of some power of 2, its binary representation will be an infinitely-long cyclic sequence, like the representation of ⅓ in decimal (.333....).

The standard C library provides the macro DBL_DIG (normally 15); any decimal number with that many decimal digits of precision can be converted to a double (for example, with scanf) and then converted back to a decimal representation (for example, with printf). To go in the opposite direction without losing information -- start with a double, convert it to decimal and then convert it back -- you need 17 decimal digits (DBL_DECIMAL_DIG). (The values I quote are based on IEEE-754 64-bit doubles).

One way to provide something close to the question would be to consider a decimal number with no more than DBL_DIG digits of precision to be an "exact-but-not-really-exact" representation of a floating point number if that floating point number is the floating point number which comes closest to the value of the decimal number. One way to find that floating point number would be to use scanf or strtod to convert the decimal number to a floating point number, and then try the floating point numbers in the vicinity (using nextafter to explore) to find which ones convert to the same representation with DBL_DIG digits of precision.

If you trust the standard library implementation to not be too far off, you could convert your double to a decimal number using sprintf, increment the decimal string at the desired digit position (which is just a string operation), and then convert it back to a double with strtod.

Upvotes: 1

Cheers and hth. - Alf

Reputation: 145429

In C++ integers must be represented in binary, but floating point types can have a decimal representation.

If FLT_RADIX from <limits.h> is 10, or some multiple of 10, then your goal of exact representation of a decimal values is attainable.

Otherwise, in general, it's not attainable.

So, as a first step, try to find a C++ implementation where FLT_RADIX is 10.

I wouldn't worry about algorithm or efficiency thereof until the C++ implementation is installed and proved to be working on your system. But as a hint, your goal seems to be suspiciously similar to the operation known as “rounding”. I think, after obtaining my decimal floating point C++ implementation, I’d start by investigating techniques for rounding, e.g., googling that, maybe Wikipedia, …

Upvotes: 0

How to shift a floating-point value to the nearest one that can be represented exactly in a specific number of decimal places?

Answers (4)

Related Questions