Christopher Miller
Christopher Miller

Reputation: 3451

Why is my implementation of fast natural exponentiation slow?

I am trying to implement a fast natural exponentiation function. I read the corresponding section in the pbrt-v4 book, and wrote this function, which computes e^x as 2^(1/ln(2) * x) by taking the integer and fractional parts i and f of 1/ln(2) * x, using a polynomial to approximate 2^f, then assuming IEEE754 float32 representation and adding i to the exponent.

The main function fast_exp is below.

inline auto fast_exp(float exponent) {
    auto base_two_exponent = exponent * 1.44269504089f;

    auto exp_integer_part = std::floor(base_two_exponent);
    auto exp_fractional_part = base_two_exponent - exp_integer_part;
    
    /* Use polynomial to approximate 2 ^ exp_fractional_part */
    auto two_to_exp_fractional_part = evaluate_polynomial(
        exp_fractional_part,
        1.f, 0.695556856f, 0.226173572f, 0.0781455737f
    );

    int resulting_exponent = IEEE754::get_exponent(two_to_exp_fractional_part)
                                   + static_cast<int>(exp_integer_part);

    /* If exponent is out of range do clamping */
    if (resulting_exponent < -126) { return 0._; }
    else if (resulting_exponent > 127) { return std::numeric_limits<float>::infinity(); }

    auto bits = IEEE754::float_to_bits(two_to_exp_fractional_part);
    bits &= 0b10000000011111111111111111111111;  /* Zero out exponent bits */
    bits |= (resulting_exponent + (1 << 7) - 1) << 23;  /* Fill in exponent with resulting exponent */
    return IEEE754::bits_to_float(bits);
}

For full clarity, the IEEE754 utility functions are defined as follows:

namespace IEEE754 {
    constexpr inline auto float_to_bits(float f) { return std::bit_cast<uint32_t>(f); }

    constexpr inline auto bits_to_float(uint32_t bits) { return std::bit_cast<float>(bits); }

    constexpr inline auto get_exponent(float f) { return (float_to_bits(f) >> 23) - ((1 << 7) - 1); }

    constexpr inline auto get_significand(float f) { return float_to_bits(f) & ((1 << 23) - 1); }

    constexpr inline auto get_sign_bit(float f) { return float_to_bits(f) & 0x80000000; }
};

and evaluate_polynomial(x, c0, c1, c2, c3, ...) uses Horner's method to evaluate the polynomial with coefficients ci as c0 + c1(x + c2(x + .... Note that I use the fused multiply-add function std::fma, which is technically only in C++23, but is supported by my compiler.


template <typename T>
constexpr inline auto evaluate_polynomial(float x, T constant_term) {
    return constant_term;
}

template <typename T, typename... Ts>
constexpr inline auto evaluate_polynomial(float x, T constant_term, Ts... coefficients) {
    return std::fma(x, evaluate_polynomial(x, coefficients...), constant_term);
}

Now, I ran the following simple benchmark with both std::exp and with fast_exp.

#include <random>
#include <chrono>
#include <cmath>

int main()
{
    /* Generate 10 million random exponents in the range [-120, 120] */
    std::mt19937 gen{std::random_device{}()};
    std::uniform_real_distribution<float> dist(-120, 120);

    std::vector<float> nums(10000000);
    for (auto &f : nums) {f = dist(gen);}

    /* Call fast_exp, or std::exp, on all those exponents */
    auto start = std::chrono::steady_clock::now();

    auto sum = 0._f;
    for (auto f : nums) {
        sum += fast_exp(f);  /* Or std::exp(f) */
    }

    auto end = std::chrono::steady_clock::now();
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << " ms\n";

    if (!sum) {return 1;}  /* Prevent the above loop from being optimized out */
    return 0;
}

Using GCC version 11.4.0 and optimization level -O2, fast_exp turns out slower than std::exp; 60-65ms vs 45ms. This is contrary to the book's claims, as well as many other online comments I've seen that claim this algorithm works faster than the standard library std::exp. Is there a specific reason why my code is slower here? Have I made a mistake?


UPDATE: I found that removing the two if-statements reduces the runtime of fast_exp by a factor of 2 (by 30ms). However, I don't know how to eliminate these branches, or if there is anything else that can be optimized here.

Upvotes: 0

Views: 148

Answers (0)

Related Questions