Michal
Michal

Reputation: 691

How to force pow(float, int) to return float

The overloaded function float pow(float base, int iexp ) was removed in C++11 and now pow returns a double. In my program, I am computing lots of these (in single precision) and I am interested in the most efficient way how to do it.

Is there some special function (in standard libraries or any other) with the above signature?

If not, is it better (in terms of performance in single precision) to explicitly cast result of pow into float before any other operations (which would cast everything else into double) or cast iexp into float and use overloaded function float pow(float base, float exp)?

EDIT: Why I need float and do not use double?

The primarily reason is RAM -- I need tens or hundreds of GB so this reduction is huge advantage. So I need from float to get float. And now I need the most efficient way to achieve that (less casts, use already optimize algorithms, etc).

Upvotes: 12

Views: 2960

Answers (5)

Boris L.
Boris L.

Reputation: 91

If you're targeting GCC you can try

float __builtin_powif(float, int)

I have no idea about it's performance tough.

Upvotes: 2

ivaigult
ivaigult

Reputation: 6667

You could easily write your own fpow using exponentiation by squaring.

float my_fpow(float base, unsigned exp)
{
    float result = 1.f;
    while (exp)
    {
        if (exp & 1)
            result *= base;
        exp >>= 1;
        base *= base;
    }

    return result;
}


Boring part:

This algorithm gives the best accuracy, that can be archived with float type when |base| > 1

Proof:

Let we want to calculate pow(a, n) where a is base and n is exponent.
Let's define b1=a1, b2=a2, b3=a4, b4=a8,and so on.

Then an is a product over all such bi where ith bit is set in n.

So we have ordered set B={bk1,bk1,...,bkn} and for any j the bit kj is set in n.

The following obvious algorithm A can be used for rounding error minimization:

  • If B contains single element, then it is result
  • Pick two elements p and q from B with minimal modulo
  • Remove them from B
  • Calculate product s = p*q and put it to B
  • Go to the first step

Now, lets prove that elements in B could be just multiplied from left to right without loosing accuracy. It comes form the fact, that:

bj > b1*b2*...*bj-1

because bj=bj-1*bj-1=bj-1*bj-2*bj-2=...=bj-1*bj-2*...*b1*b1

Since, b1 = a1 = a and its modulo more than one then:

bj > b1*b2*...*bj-1

Hence we may conclude, that during multiplication from left to right the accumulator variable is less than any element from B.

Then, expression result *= base; (except the very first iteration, for sure) does multiplication of two minimal numbers from B, so the rounding error is minimal. So, the code employs algorithm A.

Upvotes: 2

AMA
AMA

Reputation: 4214

Is there some special function (in standard libraries or any other) with the above signature?

Unfortunately, not that I know of.


But, as many have already mentioned benchmarking is necessary to understand if there is even an issue at all.

I've assembled a quick benchmark online. Benchmark code:

#include <iostream>
#include <boost/timer/timer.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/uniform_real_distribution.hpp>
#include <cmath>

int main ()
{
    boost::random::mt19937 gen;
    boost::random::uniform_real_distribution<> dist(0, 10000000);

    const size_t size = 10000000;
    std::vector<float> bases(size);
    std::vector<float> fexp(size);
    std::vector<int> iexp(size);
    std::vector<float> res(size);

    for(size_t i=0; i<size; i++)
    {
        bases[i] = dist(gen);
        iexp[i] = std::floor(dist(gen));
        fexp[i] = iexp[i];
    }

    std::cout << "float pow(float, int):" << std::endl;
    {
        boost::timer::auto_cpu_timer timer;
        for(size_t i=0; i<size; i++)
            res[i] = std::pow(bases[i], iexp[i]);
    }

    std::cout << "float pow(float, float):" << std::endl;
    {
        boost::timer::auto_cpu_timer timer;
        for(size_t i=0; i<size; i++)
            res[i] = std::pow(bases[i], fexp[i]);
    }
    return 0;
}

Benchmark results (quick conclusions):

  • gcc: c++11 is consistently faster than c++03.
  • clang: indeed int-version of c++03 seems a little faster. I'm not sure if it is within a margin of error, since I only run the benchmark online.
  • Both: even with c++11 calling pow with int seems to be a tad more performant.

It would be great if others could verify if this holds for their configurations as well.

Upvotes: 1

Arne Vogel
Arne Vogel

Reputation: 6666

Another question that can only be honestly answered with "wrong question". Or at least: "Are you really willing to go there?". float theoretically needs ca. 80% less die space (for the same number of cycles) and so can be much cheaper for bulk processing. GPUs love float for this reason.

However, let's look at x86 (admittedly, you didn't say what architecture you're on, so I picked the most common). The price in die space has already been paid. You literally gain nothing by using float for calculations. Actually, you may even lose throughput because additional extensions from float to double are required, and additional rounding to intermediate float precision. In other words, you pay extra to have a less accurate result. This is typically something to avoid except maybe when you need maximum compatibility with some other program.

See Jens' comment as well. These options give the compiler permission to disregard some language rules to achieve higher performance. Needless to say this can sometimes backfire.

There are two scenarios where float might be more efficient, on x86:

  • GPU (including GPGPU), in fact many GPUs don't even support double and if they do, it's usually much slower. Yet, you will only notice when doing very many calculations of this sort.
  • CPU SIMD aka vectorization

You'd know if you did GPGPU. Explicit vectorization by using compiler intrinsics is also a choice – one you could make, for sure, but this requires quite a cost-benefit analysis. Possibly your compiler is able to auto-vectorize some loops, but this is usually limited to "obvious" applications, such as where you multiply each number in a vector<float> by another float, and this case is not so obvious IMO. Even if you pow each number in such a vector by the same int, the compiler may not be smart enough to vectorize this effectively, especially if pow resides in another translation unit, and without effective link time code generation.

If you are not ready to consider changing the whole structure of your program to allow effective use of SIMD (including GPGPU), and you're not on an architecture where float is indeed much cheaper by default, I suggest you stick with double by all means, and consider float at best a storage format that may be useful to conserve RAM, or to improve cache locality (when you have a lot of them). Even then, measuring is an excellent idea.

That said, you could try ivaigult's algorithm (only with double for the intermediate and for the result), which is related to a classical algorithm called Egyptian multiplication (and a variety of other names), only that the operands are multiplied and not added. I don't know how pow(double, double) works exactly, but it is conceivable that this algorithm could be faster in some cases. Again, you should be OCD about benchmarking.

Upvotes: 2

ivan.ukr
ivan.ukr

Reputation: 3551

Try using powf() instead. This is C99 function that should be also available in C++11.

Upvotes: 0

Related Questions