Reputation: 3451
I am trying to implement a fast natural exponentiation function. I read the corresponding section in the pbrt-v4 book, and wrote this function, which computes e^x
as 2^(1/ln(2) * x)
by taking the integer and fractional parts i
and f
of 1/ln(2) * x
, using a polynomial to approximate 2^f
, then assuming IEEE754 float32
representation and adding i
to the exponent.
The main function fast_exp
is below.
inline auto fast_exp(float exponent) {
auto base_two_exponent = exponent * 1.44269504089f;
auto exp_integer_part = std::floor(base_two_exponent);
auto exp_fractional_part = base_two_exponent - exp_integer_part;
/* Use polynomial to approximate 2 ^ exp_fractional_part */
auto two_to_exp_fractional_part = evaluate_polynomial(
exp_fractional_part,
1.f, 0.695556856f, 0.226173572f, 0.0781455737f
);
int resulting_exponent = IEEE754::get_exponent(two_to_exp_fractional_part)
+ static_cast<int>(exp_integer_part);
/* If exponent is out of range do clamping */
if (resulting_exponent < -126) { return 0._; }
else if (resulting_exponent > 127) { return std::numeric_limits<float>::infinity(); }
auto bits = IEEE754::float_to_bits(two_to_exp_fractional_part);
bits &= 0b10000000011111111111111111111111; /* Zero out exponent bits */
bits |= (resulting_exponent + (1 << 7) - 1) << 23; /* Fill in exponent with resulting exponent */
return IEEE754::bits_to_float(bits);
}
For full clarity, the IEEE754 utility functions are defined as follows:
namespace IEEE754 {
constexpr inline auto float_to_bits(float f) { return std::bit_cast<uint32_t>(f); }
constexpr inline auto bits_to_float(uint32_t bits) { return std::bit_cast<float>(bits); }
constexpr inline auto get_exponent(float f) { return (float_to_bits(f) >> 23) - ((1 << 7) - 1); }
constexpr inline auto get_significand(float f) { return float_to_bits(f) & ((1 << 23) - 1); }
constexpr inline auto get_sign_bit(float f) { return float_to_bits(f) & 0x80000000; }
};
and evaluate_polynomial(x, c0, c1, c2, c3, ...)
uses Horner's method to evaluate the polynomial with coefficients ci
as c0 + c1(x + c2(x + ...
. Note that I use the fused multiply-add function std::fma
, which is technically only in C++23, but is supported by my compiler.
template <typename T>
constexpr inline auto evaluate_polynomial(float x, T constant_term) {
return constant_term;
}
template <typename T, typename... Ts>
constexpr inline auto evaluate_polynomial(float x, T constant_term, Ts... coefficients) {
return std::fma(x, evaluate_polynomial(x, coefficients...), constant_term);
}
Now, I ran the following simple benchmark with both std::exp
and with fast_exp
.
#include <random>
#include <chrono>
#include <cmath>
int main()
{
/* Generate 10 million random exponents in the range [-120, 120] */
std::mt19937 gen{std::random_device{}()};
std::uniform_real_distribution<float> dist(-120, 120);
std::vector<float> nums(10000000);
for (auto &f : nums) {f = dist(gen);}
/* Call fast_exp, or std::exp, on all those exponents */
auto start = std::chrono::steady_clock::now();
auto sum = 0._f;
for (auto f : nums) {
sum += fast_exp(f); /* Or std::exp(f) */
}
auto end = std::chrono::steady_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms\n";
if (!sum) {return 1;} /* Prevent the above loop from being optimized out */
return 0;
}
Using GCC version 11.4.0 and optimization level -O2
, fast_exp
turns out slower than std::exp
; 60-65ms vs 45ms. This is contrary to the book's claims, as well as many other online comments I've seen that claim this algorithm works faster than the standard library std::exp
. Is there a specific reason why my code is slower here? Have I made a mistake?
UPDATE: I found that removing the two if
-statements reduces the runtime of fast_exp
by a factor of 2 (by 30ms). However, I don't know how to eliminate these branches, or if there is anything else that can be optimized here.
Upvotes: 0
Views: 148