Reputation: 1189
I am writing a program in Visual Studio 2012 Professional (Windows) in C/C++ which consists of calculating many powers using pow()
. I ran the profiler to find out why it takes such a long time to run and I found that pow()
is the bottleneck.
I have rewritten the powers such as
pow(x,1.5)
to x*sqrt(x)
and
pow(x,1.75)
to sqrt(x*x*x*sqrt(x))
which significantly improved the speed of the program.
A few powers are of the kind pow(x,1.0/3.0)
so I looked for the cubic root function cbrt()
to speed up things but it seems not available in Visual Studio which I can hardly imagine so therefore my question:
Where can I find the cbrt()
function in Visual Studio 2012 Professional and if not, what are the alternatives except for pow(x,1.0/3.0)
?
Kind regards,
Ernst Jan
Upvotes: 4
Views: 2436
Reputation: 2749
MSC 2012 must be about their last version that didn't have cbrt implemented in math.h. The MS system cbrt benchmarks terribly but isn't so bad in real code. It is not the most accurate though.
Best combination of public code for cbrt I have found that is both accurate and fast on most compilers is Sun's implementation of Kahan's magic constant algorithm by Bruce Evans. Intel's system cbrt in their 2023 compiler is unbelievably accurate through careful use of truncation and FMA in the final refinement.
An easy alternative if there is nothing better to hand is:
double cbrt(double x)
{ // fast and accurate. NR isn't sufficient to tidy it up
// could be improved by precision (y*y)*y-x using FMA
double t, y;
if (x) y = exp(log(abs(x)) / 3); else return x;
if (x<0) y = -y;
t = y*y*y;
return y - y * (t - x) / (2 * t + x); // Halley refinement
}
Depending on the CPU architecture either of these are worth a try.
Upvotes: 0
Reputation: 11916
Implementation below is 4x faster than std::pow with relatively higher tolerance (0.000001) on an AVX-512 CPU. It is made of vertically auto-vectorizable loops for every basic operation like multiplication and division so that it computes 8,16,32 elements at once instead of horizontally vectorizing the Newton-Raphson loop.
#include <cmath>
/*
Newton-Raphson iterative solution
f_err(x) = x*x*x - N
f'_err(x) = 3*x*x
x = x - (x*x*x - N)/(3*x*x)
x = x - (x - N/(x*x))/3 <--- repeat until error < tolerance
but with vertical-parallelization
*/
template < typename Type, int Simd, int inverseTolerance>
inline
void cubeRootFast(Type *const __restrict__ data,
Type *const __restrict__ result) noexcept
{
// alignment 64 required for AVX512 vectorization
alignas(64)
Type xd[Simd];
alignas(64)
Type resultData[Simd];
alignas(64)
Type xSqr[Simd];
alignas(64)
Type nDivXsqr[Simd];
alignas(64)
Type diff[Simd];
// cube root checking mask
for (int i = 0; i < Simd; i++)
{
xd[i] = data[i] <= Type(0.000001);
}
// skips division by zero if input is zero or close to zero
for (int i = 0; i < Simd; i++)
{
resultData[i] = xd[i] ? Type(1.0) : data[i];
}
// Newton-Raphson Iterations in parallel
bool work = true;
while (work)
{
// compute x*x
for (int i = 0; i < Simd; i++)
{
xSqr[i] = resultData[i] *resultData[i];
}
// compute N/(x*x)
for (int i = 0; i < Simd; i++)
{
nDivXsqr[i] = data[i] / xSqr[i];
}
// compute x - N/(x*x)
for (int i = 0; i < Simd; i++)
{
nDivXsqr[i] = resultData[i] - nDivXsqr[i];
}
// compute (x-N/(x*x))/3
for (int i = 0; i < Simd; i++)
{
nDivXsqr[i] = nDivXsqr[i] / Type(3.0);
}
// compute x - (x-N/(x*x))/3
for (int i = 0; i < Simd; i++)
{
diff[i] = resultData[i] - nDivXsqr[i];
}
// compute error
for (int i = 0; i < Simd; i++)
{
diff[i] = resultData[i] - diff[i];
}
// compute absolute error
for (int i = 0; i < Simd; i++)
{
diff[i] = std::abs(diff[i]);
}
// compute condition to stop looping (error < tolerance)?
for (int i = 0; i < Simd; i++)
{
diff[i] = diff[i] > Type(1.0/inverseTolerance);
}
// all SIMD lanes have to have zero work left to end
Type check = 0;
for (int i = 0; i < Simd; i++)
{
check += diff[i];
}
work = (check > Type(0.0));
// compute the next x guess
for (int i = 0; i < Simd; i++)
{
resultData[i] = resultData[i] - nDivXsqr[i];
}
}
// if input was close to zero, output zero
// output result otherwise
for (int i = 0; i < Simd; i++)
{
result[i] = xd[i] ? Type(0.0) : resultData[i];
}
}
#include <iostream>
int main()
{
constexpr int n = 8192;
constexpr int simd = 16;
constexpr int inverseTolerance = 1000;
float data[n];
for (int i = 0; i < n; i++)
{
data[i] = i;
}
for (int i = 0; i < n; i += simd)
{
cubeRootFast<float, simd, inverseTolerance> (data + i, data + i);
}
for (int i = 0; i < 10; i++)
std::cout << data[i *i *i] << std::endl;
return 0;
}
It is tested only with GCC so it may require extra MSVC-pragmas on each loop to force auto-vectorization. If you have OpenMP, then you can also use #pragma omp simd safelen(Simd)
to achieve same thing.
The performance only holds within [0,1] range. To use bigger values, you should use range reduction like this:
// example: max value is 1000
for(auto & input:inputs)
input = input/1000.0f // normalize
for(..)
cubeRootFast<float, simd, inverseTolerance> (input + i, input + i)
for(auto & input:inputs)
input = 10.0f*input // de-normalize (1000 = 10 x 10 x 10)
If you need only 0.005 error on low-range like [0,1000] with 16x speedup, you can try below implementation that uses polynomial approximation (Horner-Scheme is applied to compute with FMA instructions and no explicit auto-vectorization required since it doesn't include any branching/loop inside):
// optimized for [0,1] range: ~1 cycles on AVX512, 0.003 average error
// polynomial approximation with Horner Scheme for FMA optimization
template<typename T>
T cubeRootFast(T x)
{
T xd = x-T(1.0);
T result = T(-55913.0/4782969.0);
result *= xd;
result += T(21505.0/1594323.0);
result *= xd;
result += T(-935.0/59049.0);
result *= xd;
result += T(374.0/19683.0);
result *= xd;
result += T(-154.0/6561.0);
result *= xd;
result += T(22.0/729.0);
result *= xd;
result += T(-10.0/243.0);
result *= xd;
result += T(5.0/81.0);
result *= xd;
result += T(-1.0/9.0);
result *= xd;
result += T(1.0/3.0);
result *= xd;
result += T(1.0);
return result;
}
// range reduction + dereduction: ~ 1 cycles on AVX512
for(int i=0;i<8192;i++)
{
float inp = input[i];
// scaling + descaling for range [1,999]
float scaling = (inp>333.0f)?(1000.0f):(333.0f);
scaling = (inp>103.0f)?scaling:(103.0f);
scaling = (inp>29.0f)?scaling:(29.0f);
scaling = (inp>7.0f)?scaling:(7.0f);
scaling = (inp>3.0f)?scaling:(3.0f);
output[i] = powf(scaling,0.33333333333f)*cubeRootFast<float>(inp/scaling);
}
Upvotes: 0
Reputation: 17258
This site explores several computational methods to compute cube root efficiently in C, and has some source code you can download.
(EDIT: A google search for "fast cube root" comes up with several more promising-looking hits.)
Cube roots are a topic of interest, because they're used in many common formulae and a fast cube root function isn't included with Microsoft Visual Studio.
In the absence of a special cube root function, a typical strategy is calculation via a power function (e.g., pow(x, 1.0/3.0)). This can be problematic in terms of speed and in terms of accuracy when negative numbers aren't handled properly.
His site has some benchmarks on the methods used. All of them are much faster than pow()
.
32-bit float tests ---------------------------------------- cbrt_5f 8.8 ms 5 mbp 6.223 abp pow 144.5 ms 23 mbp 23.000 abp halley x 1 31.8 ms 15 mbp 18.961 abp halley x 2 59.0 ms 23 mbp 23.000 abp newton x 1 23.4 ms 10 mbp 12.525 abp newton x 2 48.9 ms 20 mbp 22.764 abp newton x 3 72.0 ms 23 mbp 23.000 abp newton x 4 89.6 ms 23 mbp 23.000 abp
See the site for downloadable source.
Upvotes: 5