Reputation: 3123
I'm seeing some 10% run-time overhead when using a clone of a constexpr
enhanced boost.units with the float
value type using clang and -O3
level optimization. This is showing up with some of the more elaborate applications of a library that I've been working on. Given this situation, I have two questions that I'd really like to solve and would love help with:
Details...
I've been working on an interactive physics engine written in C++14. With the many different physical quantities and units it uses, I love using the compile-time enforced units and quantities that boost.units provides. Unfortunately enabling boost units seems to be coming with this run-time cost. The engine comes with a benchmark application that uses google's benchmark library to provide this insight and it takes some of the more elaborate simulations to see the overhead.
At present, due to the overhead, the engine builds by default without using boost units. By defining the right preprocessor macro name, the engine can be built with boost units. I achieved this switching using code like the following:
// #define USE_BOOST_UNITS
#if defined(USE_BOOST_UNITS)
...
#include <boost/units/systems/si/time.hpp>
...
#endif // defined(USE_BOOST_UNITS)
#if defined(USE_BOOST_UNITS)
#define QUANTITY(BoostDimension) boost::units::quantity<BoostDimension, float>
#define UNIT(Quantity, BoostUnit) Quantity{BoostUnit * float{1}}
#define DERIVED_UNIT(Quantity, BoostUnit, Ratio) Quantity{BoostUnit * float{Ratio}}
#else // defined(USE_BOOST_UNITS)
#define QUANTITY(BoostDimension) float
#define UNIT(Quantity, BoostUnit) float{1}
#define DERIVED_UNIT(Quantity, BoostUnit, Ratio) float{Ratio}}
#endif // defined(USE_BOOST_UNITS)
using Time = QUANTITY(boost::units::si::time);
constexpr auto Second = UNIT(Time, boost::units::si::second);
What I did with the UNIT
macro feels a bit suspect to me in that it's taking a boost unit type and turning it into a value. That makes switching between using or not using boost units easier however since either way expressions like 3.0f * Second
compile without warning. Checking what clang and gcc do with expressions like these appeared to confirm that they were smart enough to avoid run-time multiplying 3.0f * 1.0f
and just recognized the expression as 3.0f
. I wonder anyway if that's the cause of the overhead or if it's something else that I've done.
I've also wondered if maybe the problem is rooted in the constexpr
enhancement code I'm using or if the author(s) of that code had any idea about this overhead. On search the internet, I found a mention of overhead with the normal boost units library so seems safe to assume the enhanced units are not at fault. A suggestion that came out of my inquiring though (and my thanks go to GitHub user muggenhor for it) was the following:
I expect this is likely caused by the amount of inlining done by the compiler. Because of the wrapper functions for the operators this adds at least one function call that needs to be inlined per operation. For expressions depending on the result of sub-expressions this requires the sub-expressions to be inlined first. As a result I expect the minimum amount of inlining passes to be able to properly optimize your code to be equal to the depth of the produced expression tree...
This sounds like a pretty viable theory to me. Unfortunately, I don't know how to test it and admittedly I'm more fond of digging into my own code at the moment than into clang/LLVM code. I've tried using -inline-threshold=10000
but that doesn't seem to make the overhead go away. To my understanding of clang at least, I don't believe that specifically increases the number of inlining passes. Is there another command line argument that does? Or are there parameters within clang's sources that someone can point me to looking at as a starting point to maybe recompiling clang and trying the modified compiler?
Another theory I've had is whether using float
is the problem. I can rebuild my physics engine to use double
instead and compare benchmark results between building with and without the boost units support enabled. What I find when using double
is that the overhead at least seems to decrease. I've wondered if maybe boost units is somewhere using double
even when I use float
in its quantity
template and maybe that's causing the overhead.
Lastly, I built boost unit's performance
example with the constexpr
enhancements and ran it with both double
and float
. Got no reliable sign of any overhead which seems to eliminate my theory of float
being the problem.
Update With Data & Code
Got some more isolated data and code on this where it seems I'm seeing significantly more than 10% overhead...
Some benchmark data where Length
is basically boost::units::si::length
:
LesserLength/1000 953 ns 953 ns 724870
LesserFloat/1000 590 ns 590 ns 1093647
LesserDouble/1000 619 ns 618 ns 1198938
What the related code looks like:
static void LesserLength(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0f * playrho::Meter, 100.0f * playrho::Meter);
auto c = 0.0f * playrho::Meter;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
static_assert(std::is_same<decltype(b), const playrho::Length>::value, "not Length");
const auto v = (a < b)? a: b;
benchmark::DoNotOptimize(c = v);
}
}
}
static void LesserFloat(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0f, 100.0f);
auto c = 0.0f;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
const auto v = (a < b)? a: b;
static_assert(std::is_same<decltype(v), const float>::value, "not float");
benchmark::DoNotOptimize(c = v);
}
}
}
static void LesserDouble(benchmark::State& state)
{
const auto vals = RandPairs(static_cast<unsigned>(state.range()),
-100.0, 100.0);
auto c = 0.0;
for (auto _: state)
{
for (const auto& val: vals)
{
const auto a = std::get<0>(val);
const auto b = std::get<1>(val);
const auto v = (a < b)? a: b;
static_assert(std::is_same<decltype(v), const double>::value, "not double");
benchmark::DoNotOptimize(c = v);
}
}
}
With this as a hint to me, I checked Godbolt with the following code to see what clang 5.0.0 and gcc 7.2 would generate:
#include <algorithm>
#include <boost/units/systems/si/length.hpp>
#include <boost/units/cmath.hpp>
using length = boost::units::quantity<boost::units::si::length, float>;
float f(float a, float b)
{
return a < b? a: b;
}
length f(length a, length b)
{
return a < b? a: b;
}
I see that the generated assembly looks quite different between the two functions and between clang and gcc. Here's a gist of the relevant assembly from clang (with the boost stuff here simply shown as length
):
f(float, float): # @f(float, float)
minss xmm0, xmm1
ret
f(length, length)
movss xmm0, dword ptr [rdx] # xmm0 = mem[0],zero,zero,zero
ucomiss xmm0, dword ptr [rsi]
cmova rdx, rsi
mov eax, dword ptr [rdx]
mov dword ptr [rdi], eax
mov rax, rdi
ret
Shouldn't both of these compilers using -O3
optimization be returning the same assembly though for the length
version as they do for the float
version? Is the problem that they're not quite optimizing down all the way to the same code as for float
? Seems like this is the problem and if so that's progress but I still want to figure out what can be done to really get zero overhead.
Upvotes: 2
Views: 472