Reputation: 11

Catching loss of precision in floating numbers

I am coding a little calculator in C for my exam preparation. I understand that double is more precise than float since it has 11 bits reserved for the exponent and 53 bits for the significand. When it comes to integers, I can do the following to catch Over/underflows.

int sum(int a, int b, int *res){
    if((b > 0) && (a > INT_MAX + b)){
        return OVERFLOW_ERROR;
    }
    else if((b < 0) && (a < INT_MAX + b)){
        return UNDERFLOW_ERROR;
    }else {
        *res = a + b; 
    }

    return (EXIT_SUCCESS);
}

When it comes to double, if the number is too high, the console will give you "inf" or "-inf", which in any case isn´t too bad. AFAIK, floating numbers overflow, when they lose precision

So, my question is, how do you handle the loss of precision? Can you make them "precise"? When do they lose precision?

Upvotes: 0

Answers (2)

Luis Colorado

Reputation: 12698

I can recommend you to use libgmp.a or some similar library if you want more precision to do the calculations. I cannot imagine the environment you are going to use it, apart of cryptography or getting more and more decimals of pi, but you have libraries that allow you to extend the capabilities of the natural precision of the computer.

There's an example in free42, which is an emulation to the hp-42s pocket calculator (and implemented by Swissmicros in their range of pocket calculators ---see here, for info) they use 128bit floating point numbers, giving a precision of 32 decimal digits.

But the gain in precision has a penalty (well, not for a simple calculator) is that the operations have to be solved in software, there are not anymore machine instructions to multiply two floating point numbers. Each basic operation must be solved in software, and this slows down the overal calculations.

Upvotes: 0

John Graham

Reputation: 499

It's been a while since I looked at this properly, but it sounds like you're mixing up your terms - overflow (a numerical value becoming too large) is different to loss of precision (chopping off part of the significand).

IIRC, loss of precision happens either when converting to a shorter floating-point formats or when floating-point numbers become sub-normal/denormalized, so if you really want the greatest precision possible, use long double (or see if your compiler supports a wider floating-point format) and check for sub-normal numbers at each stage of a calculation. You can't make any floating-point number/calculation "absolutely precise" unless you know you're only dealing with numbers that can be represented exactly (e.g. 0.5, 0.25, 0.125, etc.) and don't do crazy things like add two numbers of wildly different magnitudes together.

Generally, dealing with these sorts of numerical errors is pretty involved, and specific to the calculation being done - e.g. you might re-arrange an equation to such that you avoid subtracting two numbers that are very close to each other in value so you don't lose significance.

If you've not come across it, What Every Computer Scientist Should Know About Floating-Point Arithmetic is a fantastic free article, and I can highly recommend Numerical Computing with IEEE Floating Point Arithmetic for a good read.

Upvotes: 0

Catching loss of precision in floating numbers

Answers (2)

Related Questions