Reputation: 51

Performance difference between accessing the member of a heap and a stack object?

Currently I'm using the '->' operator to dereference members inside a class. My question is wether is it faster than normal member accessing. For example:

Class* myClsPtr = new Class();
myClsPtr->foo(bar);

Vs.

Class myCls;
myCls.foo(bar);

Can use both ways without a performence difference?

Upvotes: 1

Answers (4)

Peter - Reinstate Monica

Reputation: 16016

I found the results puzzling, so I investigated a little further. First I enhanced the example prog by using chrono and adding one test which accesses the local variable (instead of memory on the heap) through a pointer. That made sure that a timing difference was not caused by the location of the object but by the access method.

Second I added a dummy member to the struct because I noticed that the direct member destination used an offset to the stack pointer which I suspected could be the culprit; the pointer version accessed the memory through a register without offset. The dummy leveled the field there. It didn't make a difference though.

Access through a pointer was significantly faster for both the heap and the local object. Here's the source:

#include<chrono>
#include<iostream>

using namespace std;
using namespace std::chrono;

struct MyStruct { /* offset for i */ int dummy; int i; };

int main()
{
    MyStruct *heapPtr = new MyStruct;
    MyStruct localObj;
    MyStruct *localPtr = &localObj;

    ///////////// ptr to heap /////////////////////
    auto t1 = high_resolution_clock::now();
    for (int i = 0; i < 100000000; ++i)
    {
        heapPtr->i = i;
    }
    auto t2 = high_resolution_clock::now();
    cout << "heap ptr: " 
        << duration_cast<milliseconds>(t2-t1).count() 
        << " ms" << endl;

    ////////////////// local obj ///////////////////////
    t1 = high_resolution_clock::now();
    for (int i = 0; i < 100000000; ++i)
    {
        localObj.i = i;
    }
    t2 = high_resolution_clock::now();
    cout << "local: " 
        << duration_cast<milliseconds>(t2-t1).count() 
        << " ms" << endl;

    ////////////// ptr to local /////////////////
    t1 = high_resolution_clock::now();
    for (int i = 0; i < 100000000; ++i)
    {
        localPtr->i = i;
    }
    t2 = high_resolution_clock::now();
    cout << "ptr to local: " 
        << duration_cast<milliseconds>(t2-t1).count() 
        << " ms" << endl;

    /////////// have a side effect ///////////////
    return heapPtr->i + localObj.i;
}

Here is a typical run. Differences between heap and local ptr are random in both directions.

heap ptr: 217 ms
local: 236 ms
ptr to local: 206 ms

Here is the disassembly of the pointer and the direct access. I assume that heapPtr's stack offset is 0x38 so that the first mov moves it's contents, i.e. the address of the object on the heap which it points to, to %rax. This serves as the address to move the value to in the third move (with a 4 byte offset due to the preceding dummy member).

The second move gets i's value (i is apparently at stack offset 4C, which lines up if you count all the intervening definitions) into %edx (because the last mov can have at most one memory operand, which is the object, so the value in i must go into a register).

The last mov gets i's value, in register %edx, into the object's address, now in %rax, plus an offset of 4 because of the dummy.

                heapPtr->i = i;
  3e:   48 8b 45 38             mov    0x38(%rbp),%rax
  42:   8b 55 4c                mov    0x4c(%rbp),%edx
  45:   89 50 04                mov    %edx,0x4(%rax)

As was to be expected, the direct access is shorter. The variable's value (different local i, this time at stack offset 0x48) is loaded in register %eax which is then written into the adddress at stack offset -0x60 (I don't know why some local objects are stored at positive offsets and others at negative ones). The bottom line is that this is one instruction shorter than the pointer access; basically, the first instruction of the pointer access, which loads the pointer's value into an address register, is missing. That is exactly what we would expect -- that's the dereferencing. Nonetheless the direct access takes more time. I have no idea why. Since I excluded most possibilities I must assume that either using %rbp is slower than using %rax (unlikely) or that a negative offset slows access down. Is that so?

                localObj.i = i;
  d6:   8b 45 48                mov    0x48(%rbp),%eax
  d9:   89 45 a0                mov    %eax,-0x60(%rbp)

It should be noted that gcc moves the assignment out of the loop when optimization is turned on. So this is in a way a phantom problem for people concerned about performance. Additionally these small differences will be drowned out by anything "real" happening in the loops. But it is still unexpected.

Upvotes: 0

Peter - Reinstate Monica

Reputation: 16016

Since the a->b is equivalent to (*a).b (and that's indeed what the compiler must create, at least logically) -> could be slower than ., if at all. In practice the compiler will likely store a's address in a register and add the offset b immediately, skipping the (*a) part and reducing it effectively to a.b internally.

With -O3 gcc 4.8.2 eliminates the whole loop, by the way. It even does that if we return the last MyStruct::i from main -- the loop is side effect free and the end value is trivially computable. Just another bench-remark.

And then it's not about the object being on the heap but it's about using an address vs. using an object right away. The logic would be the same for the same object:

MyStruct m;
mp = &m;

and then run your two loops, with m or mp resepctively. The position (in terms of which memory page it is on) of an object may matter a lot more than whether you access it directly or via a pointer because locality tends to be important in modern architectures (with caches and parallelism). If some memory is already in a cached memory location (the stack may well be cached) it's much faster to access than some location which must be loaded into the cache first (some arbitrary heap location). In either loop the memory where the object resides will likely stay cached because not much else happens there, but in more realistic scenarios (iterating over pointers in a vector: where do the pointers point to? Scattered or contiguous memory?) these considerations will far outweigh the cheap dereferencing.

Upvotes: 0

Adrian Ratnapala

Reputation: 5693

As with so many performance questions, the answer is complicated and variable. The potential sources of slowness using the heap are:

Time to allocate and deallocate objects.
The possibility that the object is not in the cache.

Both of these mean an object on the heap might be slow at first. But this wont matter much if you use the object many times in a tight loop: soon the object will end up in the CPU cache whether it lives in the heap or stack.

A related issue is whether objects that contain other objects should use pointers or copies. If speed is the only issue, it is probably better to store copies, because each new pointer lookup is a potential cache-miss.

Upvotes: 2

László Papp

Reputation: 53155

First,

Class myCls = new Class();

is invalid code... Let us assume you meant

Class myCls;

There will be pretty much no noticable difference, but you could benchmark it yourself by iterating million times in a loop, and call either variant while timing both execution time.

I have just made a quick and dirty benchmark on my laptop with the iteration of one hundred million as follows:

Stack Object

struct MyStruct
{
    int i;
};

int main()
{
    MyStruct stackObject;

    for (int i = 0; i < 100000000; ++i)
        stackObject.i = 0;

    return 0;
}

and then I ran:

g++ main.cpp && time ./a.out

the result is:

sreal   0m0.301s
user    0m0.303s
sys 0m0.000s

Heap Object

struct MyStruct
{
    int i;
};

int main()
{
    MyStruct *heapObject = new MyStruct();

    for (int i = 0; i < 100000000; ++i)
        heapObject->i = 5;

    return 0;
}

and then I ran:

g++ main.cpp && time ./a.out

the result is:

real    0m0.253s
user    0m0.250s
sys 0m0.000s

As you can see, the heap object is slightly faster on my machine for 100 millions of iteration. Even on my machine, this would be unnoticable for significantly fewer items. One thing that stands out is that, although the results are slightly distinct for subsequent runs, the heap object version is always performing better on my laptop. Do not take it as a guarantee, however.

Upvotes: 4

Performance difference between accessing the member of a heap and a stack object?

Answers (4)

Stack Object

Heap Object

Related Questions