Fire Lancer
Fire Lancer

Reputation: 30145

C++ Small Object Performance

In my program code there are various fairly small objects ranging from a byte or 2 upto about 16. E.g. Vector2 (2 * T), Vector3 (3 * T), Vector4 (4 * T), ColourI32 (4), LightValue16 (2), Tile (2), etc (byte size in brackets).

Was doing some profiling (sample based) which led me to some slower than expected functions, e.g.

//4 bits per channel natural light and artificial RGB
class LightValue16
{
...
    explicit LightValue16(uint16_t value);
    LightValue16(const LightValueF &);
    LightValue16(int r, int g, int b, int natural);

    int natural()const;
    void natural(int v);
    int artificialRed()const;
    ...
    uint16_t data;
};
...
LightValue16 World::getLight(const Vector3I &pos)
{ ... }

This function does some maths to lookup the value via a couple of arrays, with some default values for above the populated part of the world. The contents are inlined nicely and looking at the disassembly looks about as good as it can get.with about 100 instructions. However one thing stood out, on all the return sites it was implemented with something like:

mov eax, dword pyt [ebp + 8]
mov cx, word ptr[ecx + edx * 2] ; or say mov ecx, Fh
mov word ptr [eax], cx
pop ebp
ret 10h

For x64 I saw pretty much the same thing. I didn't check my GCC build, but I suspect it does pretty much the same thing.

I did a little experimenting and found by using a uint16_t return type. It actually resulted in the World::getLight function getting inlined (looked like pretty much the same core 80 instructions or so, no cheats with conditionals/loops being different) and the total CPU usage for the outer function I was investigating to go from 16.87% to 14.04% While I can do that on a case by case bases (along with trying the force inline stuff I suppose), is there any practical ways to avoid such performance issues to start with? Perhaps even get a couple of % faster across the entire code?

The best I can think of just now is to just use the primitive types in such cases ( < 4 or perhaps 8 byte objects) and move all the current member stuff into non member functions, so more like as done in C, just with namespaces.

Thinking about this I guess there is also often a cost to stuff like "t foo(const Vector3F &p)" over "t foo(float x, float y, float z)"? And if so, over a program extensively using the const&, could it add up to a significant difference?

Upvotes: 5

Views: 1386

Answers (2)

Michael Karcher
Michael Karcher

Reputation: 4031

Take a look at the Itanium C++ ABI. While your computer definitely has no Itanium processor, gcc models the x86 and x86-64 ABI very similar to the Itanium ABI. The linked section states that

However, if the return value type has a non-trivial copy constructor or destructor, [return into caller-provided memory happens]

To find out what non-trivial copy constructor or destructor means, take a look into What are Aggregates and PODs and how/why are they special?, and peek at the rules for a class to be "trivially copyable". In your case, the problem is the copy constructor you defined. It should not be needed at all, the compiler will synthesize a copy constructor that just assigns the data member as needed. If you want to explicitly state that you want a copy constructor, and you are using C++11, you can also write it down as defaulted function, which does not make it non-trivial:

LigthValue16(const LightValue16 & other) = default;

Upvotes: 2

eci
eci

Reputation: 2422

In the comments to this question there has already been a lot of discussion, whether the compiler is allowed to handle class LightValue16 as a simple uint16_t for the function you analyzed.

If your class contains no special magic (like virtual functions) and the whole class is visible to the analyzed function, the compiler can produce code which is 100% equally efficient then just using a `uint16_t type.

The problem is "can". Although all decent compilers will usually generate code which is 100% as fast as, there will be sporadically situations where some optimization will not be applied or at least the resulting code will be different. It might just be that a parameter of a heuristic changes (e.g. inline will not be applied because just a little bit more code in some optimization step remains because of the class) or some optimization pass just really requires a plain numeric type at this stage, which is not even a real bug in the compiler. For example, if you add a "template < bool NotUsed>" to your class above this will probably change optimization steps within a compiler, although semantically your program does not change.

So, if you want to be 100% sure, use only int's or double's directly. But in 90% of the time it will be 100% as fast, only in 10% it will be only 90% of the performance, which should be O.K. for 99% percent (but not 100%) of all use-cases.

Upvotes: 0

Related Questions