Run-time overhead with boost.units?

Question

I'm seeing some 10% run-time overhead when using a clone of a constexpr enhanced boost.units with the float value type using clang and -O3 level optimization. This is showing up with some of the more elaborate applications of a library that I've been working on. Given this situation, I have two questions that I'd really like to solve and would love help with:

Boost units is supposed to be a zero-overhead library so why am I seeing the overhead?
More importantly, besides not using boost.units, how can I get the overhead to go away?

Details...

I've been working on an interactive physics engine written in C++14. With the many different physical quantities and units it uses, I love using the compile-time enforced units and quantities that boost.units provides. Unfortunately enabling boost units seems to be coming with this run-time cost. The engine comes with a benchmark application that uses google's benchmark library to provide this insight and it takes some of the more elaborate simulations to see the overhead.

At present, due to the overhead, the engine builds by default without using boost units. By defining the right preprocessor macro name, the engine can be built with boost units. I achieved this switching using code like the following:

// #define USE_BOOST_UNITS
#if defined(USE_BOOST_UNITS)
...
#include 
...
#endif // defined(USE_BOOST_UNITS)

#if defined(USE_BOOST_UNITS)
#define QUANTITY(BoostDimension) boost::units::quantity
#define UNIT(Quantity, BoostUnit) Quantity{BoostUnit * float{1}}
#define DERIVED_UNIT(Quantity, BoostUnit, Ratio) Quantity{BoostUnit * float{Ratio}}
#else // defined(USE_BOOST_UNITS)
#define QUANTITY(BoostDimension) float
#define UNIT(Quantity, BoostUnit) float{1}
#define DERIVED_UNIT(Quantity, BoostUnit, Ratio) float{Ratio}}
#endif // defined(USE_BOOST_UNITS)

using Time = QUANTITY(boost::units::si::time);
constexpr auto Second = UNIT(Time, boost::units::si::second);

What I did with the UNIT macro feels a bit suspect to me in that it's taking a boost unit type and turning it into a value. That makes switching between using or not using boost units easier however since either way expressions like 3.0f * Second compile without warning. Checking what clang and gcc do with expressions like these appeared to confirm that they were smart enough to avoid run-time multiplying 3.0f * 1.0f and just recognized the expression as 3.0f. I wonder anyway if that's the cause of the overhead or if it's something else that I've done.

I've also wondered if maybe the problem is rooted in the constexpr enhancement code I'm using or if the author(s) of that code had any idea about this overhead. On search the internet, I found a mention of overhead with the normal boost units library so seems safe to assume the enhanced units are not at fault. A suggestion that came out of my inquiring though (and my thanks go to GitHub user muggenhor for it) was the following:

I expect this is likely caused by the amount of inlining done by the compiler. Because of the wrapper functions for the operators this adds at least one function call that needs to be inlined per operation. For expressions depending on the result of sub-expressions this requires the sub-expressions to be inlined first. As a result I expect the minimum amount of inlining passes to be able to properly optimize your code to be equal to the depth of the produced expression tree...

This sounds like a pretty viable theory to me. Unfortunately, I don't know how to test it and admittedly I'm more fond of digging into my own code at the moment than into clang/LLVM code. I've tried using -inline-threshold=10000 but that doesn't seem to make the overhead go away. To my understanding of clang at least, I don't believe that specifically increases the number of inlining passes. Is there another command line argument that does? Or are there parameters within clang's sources that someone can point me to looking at as a starting point to maybe recompiling clang and trying the modified compiler?

Another theory I've had is whether using float is the problem. I can rebuild my physics engine to use double instead and compare benchmark results between building with and without the boost units support enabled. What I find when using double is that the overhead at least seems to decrease. I've wondered if maybe boost units is somewhere using double even when I use float in its quantity template and maybe that's causing the overhead.

Lastly, I built boost unit's performance example with the constexpr enhancements and ran it with both double and float. Got no reliable sign of any overhead which seems to eliminate my theory of float being the problem.

Update With Data & Code

Got some more isolated data and code on this where it seems I'm seeing significantly more than 10% overhead...

Some benchmark data where Length is basically boost::units::si::length:

LesserLength/1000                          953 ns        953 ns     724870
LesserFloat/1000                           590 ns        590 ns    1093647
LesserDouble/1000                          619 ns        618 ns    1198938

What the related code looks like:

static void LesserLength(benchmark::State& state)
{
    const auto vals = RandPairs(static_cast(state.range()),
                                -100.0f * playrho::Meter, 100.0f * playrho::Meter);
    auto c = 0.0f * playrho::Meter;
    for (auto _: state)
    {
        for (const auto& val: vals)
        {
            const auto a = std::get<0>(val);
            const auto b = std::get<1>(val);
            static_assert(std::is_same::value, "not Length");
            const auto v = (a < b)? a: b;
            benchmark::DoNotOptimize(c = v);
        }
    }
}

static void LesserFloat(benchmark::State& state)
{
    const auto vals = RandPairs(static_cast(state.range()),
                                -100.0f, 100.0f);
    auto c = 0.0f;
    for (auto _: state)
    {
        for (const auto& val: vals)
        {
            const auto a = std::get<0>(val);
            const auto b = std::get<1>(val);
            const auto v = (a < b)? a: b;
            static_assert(std::is_same::value, "not float");
            benchmark::DoNotOptimize(c = v);
        }
    }
}

static void LesserDouble(benchmark::State& state)
{
    const auto vals = RandPairs(static_cast(state.range()),
                                -100.0, 100.0);
    auto c = 0.0;
    for (auto _: state)
    {
        for (const auto& val: vals)
        {
            const auto a = std::get<0>(val);
            const auto b = std::get<1>(val);
            const auto v = (a < b)? a: b;
            static_assert(std::is_same::value, "not double");
            benchmark::DoNotOptimize(c = v);
        }
    }
}

With this as a hint to me, I checked Godbolt with the following code to see what clang 5.0.0 and gcc 7.2 would generate:

#include 
#include 
#include 

using length = boost::units::quantity;

float f(float a, float b)
{
    return a < b? a: b;
}

length f(length a, length b)
{
    return a < b? a: b;
}

I see that the generated assembly looks quite different between the two functions and between clang and gcc. Here's a gist of the relevant assembly from clang (with the boost stuff here simply shown as length):

f(float, float): # @f(float, float)
  minss xmm0, xmm1
  ret
f(length, length)
  movss xmm0, dword ptr [rdx] # xmm0 = mem[0],zero,zero,zero
  ucomiss xmm0, dword ptr [rsi]
  cmova rdx, rsi
  mov eax, dword ptr [rdx]
  mov dword ptr [rdi], eax
  mov rax, rdi
  ret

Shouldn't both of these compilers using -O3 optimization be returning the same assembly though for the length version as they do for the float version? Is the problem that they're not quite optimizing down all the way to the same code as for float? Seems like this is the problem and if so that's progress but I still want to figure out what can be done to really get zero overhead.

Run-time overhead with boost.units?

Answers (0)

Related Questions