Reputation: 1518

Why is iterating a list of objects slower than iterating a list of object pointers?

After reading this blog post about how unfriendly a list is to cache: http://www.baptiste-wicht.com/2012/11/cpp-benchmark-vector-vs-list/

... I tried to make a std::list of pointers to objects more cache friendly by putting the actual object into each node (thereby removing one indirection operation) in the hope that when the current node is cached, the object will be too. However, performance actually decreased. Here's the code I used:

Source and binaries: http://wilcobrouwer.nl/bestanden/ListTest%202013-8-15%20%233.7z

#include <list>
using std::list;

list<Object*> case1;
list<Object> case2;

class Object {
    public:
        Object(char i);
        ~Object();

        char dump[256];
};

// Should not notice much of a difference here, equal amounts of memory are 
// allocated
void Insertion(Test* test) {

    // create object, copy pointer
    float start1 = clock->GetTimeSec();
    for(int i = 0;i < test->size;i++) {
        case1.push_back(new Object(i)); 
    }
    test->insertion1 = clock->GetTimeSec()-start1;

    // create object in place, no temps on stack
    float start2 = clock->GetTimeSec();
    for(int i = 0;i < test->size;i++) {
        case2.emplace_back(i); 
    }
    test->insertion2 = clock->GetTimeSec()-start2;
}

// Case 2 removes one extra layer of derefence, so it should be more cache 
// friendly, because when the list node is found in cache, the object should be
// there too
void Iteration(Test* test) {

    // faster than case2 for some reason
    float start1 = clock->GetTimeSec();
    int tmp1 = 0;
    for(list<Object*>::iterator i = case1.begin();i != case1.end();i++) {
        tmp1 += (**i).dump[128]; 
    }
    test->iteration1 = clock->GetTimeSec()-start1;

    // why the hell is this slower? I removed a dereference
    float start2 = clock->GetTimeSec();
    int tmp2 = 0;
    for(list<Object>::iterator i = case2.begin();i != case2.end();i++) {
        tmp2 += (*i).dump[128]; // is equal to tmp1, so no mistakes...
    }
    test->iteration2 = clock->GetTimeSec()-start2;
}

// Case 2 removes one extra layer of derefence, so it should be more cache 
// friendly, because when the list node is found in cache, the object should be
// there too
void Deletion(Test* test) {

    // again, faster than case2 for some reason
    float start1 = clock->GetTimeSec();
    int size1 = case1.size();
    for(list<Object*>::iterator i = case1.begin();i != case1.end();i++) {
        delete *i;
    }
    case1.clear();
    test->deletion1 = clock->GetTimeSec()-start1;

    // as before: why is this slower? I removed a dereference
    float start2 = clock->GetTimeSec();
    int size2 = case2.size();
    case2.clear();
    test->deletion2 = clock->GetTimeSec()-start2;
}

These functions are run for test->size values linearly varying from 1 to 100000, and time differences between clock->GetTimeSec() are saved to disk after calculations are finished. A plot of my results can be found here:

http://wilcobrouwer.nl/bestanden/ListTestFix.png
As you can see, case 2 is about 10% faster at inserting and deleting, but about 10% slower at iterating, which means the extra dereference needed for iterating case 1 makes it faster!

What am I missing here?

Edit 1: my CPU is a Phenom II X4 @ 3.5GHz (constant frequency) with 64K/1MB/6MB of cache, and I'm compiling this way (please note that -m64 is implied, which implies a ban on x87 via -mfpmath=ssse):

Compiler: TDM-GCC 4.7.1 64-bit Release
rm -f obj/Clock.o obj/main.o obj/Object.o ListTest.exe
g++.exe -c Clock.cpp -o obj/Clock.o -std=gnu++11
g++.exe -c main.cpp -o obj/main.o -std=gnu++11
g++.exe -c Objecst.cpp -o obj/Object.o -std=gnu++11
g++.exe obj/Clock.o obj/main.o obj/Object.o -o ListTest.exe -static-libgcc

Edit 2: answer to Dale Wilson: with list I mean a std::list. Answer to Mats Petersson: a summary has been added to the picture. Optimization checks are under way. Answer to the guy who asked about bigger data sets: sorry, I've only got 4GiB of RAM, and the plots from the current maximum up to filling that are quite boring.

Edit 3: I've enabled -O3 (-O2 produces similar results), which only made stuff worse:

http://wilcobrouwer.nl/bestanden/ListTestO3Fix.png
This time, case 2 is about 20% faster at inserting and deleting, but this time about 1~5 times slower at iteration (gets worse at higher test sizes). Same conclusion.

Edit 4: answer to Maxim Yegorushkin: CPU frequency scaling happens to be disabled (forgot to mention), my CPU always runs at 3.5GHz. Also, picking averages or best results out of more tests is basically done too, because there are more than enough sample points on the x axis. Optimization is enabled too: -O3, -m64 and mfpmath=sse are all set. Adding identical tests after each other to std::vector tests (check the source) did not change anything significant.

Edit 5: fixed a few typos (deletion results were not shown, but iteration results were shown twice. This has cleared up the deletion problem, but the iteration problem remains.

Upvotes: 5

Answers (6)

Kylotan

Reputation: 18449

Pure speculation: the list of objects may actually be less cache friendly. The memory allocator probably has to put the node+object structure into a 512 byte slot, with most of it empty, because it's 256 bytes plus whatever list node overhead exists. By comparison, the list of pointers is able to put the objects in contiguous 256 byte slots and the nodes in (for example) contiguous 16 bytes slots - 2 separate parts of memory, but both packed densely.

Test case - try reducing that array to 220 in size.

Upvotes: 0

user2249683

Reputation:

My tests show storing objects is a bit faster than storing pointers. If the amout of objects/pointers is too high the memory management gets into trouble (swapping).

The source i use:

#include <algorithm>
#include <chrono>
#include <iostream>
#include <list>
using std::list;
using namespace std::chrono;

struct Test {
    int size = 1000000;
    duration<double> insertion1;
    duration<double> insertion2;
    duration<double> iteration1;
    duration<double> iteration2;
    duration<double> deletion1;
    duration<double> deletion2;
};

class Object {
    public:
    Object(char i);
    ~Object();

    char dump[256];
};

Object::Object(char i) { std::fill_n(dump, 256, i); }
Object::~Object() {}

list<Object*> case1;
list<Object>  case2;

// Should not notice much of a difference here, equal amounts of memory are
// allocated
void Insertion(Test& test, int order) {

    for(int n = 0; n < 2; ++n) {
        // create object, copy pointer
        if((n == 0 && order == 0) || (n == 1 && order == 1))
        {
            high_resolution_clock::time_point start1 = high_resolution_clock::now();
            for(int i = 0;i < test.size;i++) {
                case1.push_back(new Object(i));
            }
            test.insertion1 = duration_cast<duration<double>>(high_resolution_clock::now() - start1);
        }

        // create object in place, no temps on stack
        if((n == 0 && order != 0) || (n == 1 && order != 1))
        {
            high_resolution_clock::time_point start2 = high_resolution_clock::now();
            for(int i = 0;i < test.size;i++) {
                case2.emplace_back(i);
            }
            test.insertion2 = duration_cast<duration<double>>(high_resolution_clock::now() - start2);
        }
    }
}

// Case 2 removes one extra layer of derefence, so it should be more cache
// friendly, because when the list node is found in cache, the object should be
// there too
void Iteration(Test& test, int order) {

    for(int n = 0; n < 2; ++n) {
        // faster than case2 for some reason
        if((n == 0 && order == 0) || (n == 1 && order == 1))
        {
            high_resolution_clock::time_point start1 = high_resolution_clock::now();
            int tmp1 = 0;
            for(list<Object*>::iterator i = case1.begin();i != case1.end();i++) {
                tmp1 += (**i).dump[128];
            }
            test.iteration1 = duration_cast<duration<double>>(high_resolution_clock::now() - start1);
        }

        // why the hell is this slower? I removed a dereference
        if((n == 0 && order != 0) || (n == 1 && order != 1))
        {
            high_resolution_clock::time_point start2 = high_resolution_clock::now();
            int tmp2 = 0;
            for(list<Object>::iterator i = case2.begin();i != case2.end();i++) {
                tmp2 += (*i).dump[128]; // is equal to tmp1, so no mistakes...
            }
            test.iteration2 = duration_cast<duration<double>>(high_resolution_clock::now() - start2);
        }
    }
}

// Case 2 removes one extra layer of derefence, so it should be more cache
// friendly, because when the list node is found in cache, the object should be
// there too
void Deletion(Test& test, int order) {

    for(int n = 0; n < 2; ++n) {
        // again, faster than case2 for some reason
        if((n == 0 && order == 0) || (n == 1 && order == 1))
        {
            high_resolution_clock::time_point start1 = high_resolution_clock::now();
            int size1 = case1.size();
            for(list<Object*>::iterator i = case1.begin();i != case1.end();i++) {
                delete *i;
            }
            case1.clear();
            test.deletion1 = duration_cast<duration<double>>(high_resolution_clock::now() - start1);
        }

        // as before: why is this slower? I removed a dereference
        if((n == 0 && order != 0) || (n == 1 && order != 1))
        {
            high_resolution_clock::time_point start2 = high_resolution_clock::now();
            int size2 = case2.size();
            case2.clear();
            test.deletion2 = duration_cast<duration<double>>(high_resolution_clock::now() - start2);
        }
    }
}

int main() {
    Test test;
    std::cout
        << "First Test:\n"
           "==========" << std::endl;
    Insertion(test, 0);
    std::cout
        <<   "Insertion [Ptr] " << test.insertion1.count()
        << "\n          [Obj] " << test.insertion2.count() << std::endl;
    Iteration(test, 0);
    std::cout
        <<   "Iteration [Ptr] " << test.iteration1.count()
        << "\n          [Obj] " << test.iteration2.count() << std::endl;
    Deletion(test, 0);
    std::cout
        <<   "Deletion  [Ptr] " << test.deletion1.count()
        << "\n          [Obj] " << test.deletion2.count() << std::endl;

    std::cout
        << "Second Test:\n"
           "===========" << std::endl;
    Insertion(test, 1);
    std::cout
        <<   "Insertion [Ptr] " << test.insertion1.count()
        << "\n          [Obj] " << test.insertion2.count() << std::endl;
    Iteration(test, 1);
    std::cout
        <<   "Iteration [Ptr] " << test.iteration1.count()
        << "\n          [Obj] " << test.iteration2.count() << std::endl;
    Deletion(test, 1);
    std::cout
        <<   "Deletion  [Ptr] " << test.deletion1.count()
        << "\n          [Obj] " << test.deletion2.count() << std::endl;
    return 0;
}

Output:

First Test:
==========
Insertion [Ptr] 0.298454
          [Obj] 0.253187
Iteration [Ptr] 0.041983
          [Obj] 0.038143
Deletion  [Ptr] 0.154887
          [Obj] 0.187797
Second Test:
===========
Insertion [Ptr] 0.291386
          [Obj] 0.268011
Iteration [Ptr] 0.039379
          [Obj] 0.039853
Deletion  [Ptr] 0.150818
          [Obj] 0.105357

Pleas notice at deletion the list deleted first is faster than the second. Seems the issue is in the memory management.

Upvotes: 0

Maxim Egorushkin

Reputation: 136266

A bit off-topic but such a benchmarking methodology does not yield correct and repeatable results because it ignores cache effects, CPU frequency scaling and the process scheduler.

To measure times correctly it needs to run each micro-benchmark (i.e. each and every loop) a few times (say at least 3) and pick the best time. That best time is the best possible time achievable when the CPU cache, TLB and the branch predictor are hot. You need the best times because the worst times have no upper bound, so that they can not be meaningfully compared.

When benchmarking you also need to disable CPU frequency scaling, so that it does not switch frequencies in the middle of your benchmark. It also ought to be run with a real-time priority to reduce scheduling noise resulting from other processes pre-empting your benchmark.

And don't forget to compile it with optimization.

Next, lets review your benchmarks:

Insertion: it basically measures the time of two memory allocations (list<Object*>) vs. one memory allocation (list<Object>).
Deletion: same as above, replace allocation with deallocation.
Iteration: your object size is 256 bytes, which is 4x64-byte cache lines. Such an object size is too big compared to the list node size, so you are likely measuring the time of cache misses when it reads a byte from a 256-byte object.

What you really want to measure is iteration of a list vs. iteration over an array while reading all bytes of an object (e.g. sum all bytes of an object). Your hypothesis is that when objects are laid out in an array and accessed sequentially the CPU pre-loads the next object into the cache so that when you get to access it you don't incur a cache miss. Whereas when objects are stored in a list whose nodes are not contiguous in memory that cache read-ahead does not improve speed because the next object is not adjacent in memory to the current one, so that when it chases the pointer of a list it incurs a cache miss.

Upvotes: 3

galets

Reputation: 18472

Usually, when you assign

Object left = right;

that is equivalent to:

allocate memory for left (usually on the stack if it's a local variable)
call copy constructor Object::Object(Object& right). If not declared, copy constructors are generated implicitly by compiler.

so that's a little more code to execute than one of the following:

Object& left = right;
const Object& left = right; 
Object* pLeft = &right;

either construct will just create a pointer, not a new object.

However, in your case you use list<Object>::iterator, which i think is a pointer, so this does not explain speed discrepancy.

Upvotes: 0

Zac Howland

Reputation: 15872

In terms of inserting, case 1 is slower because it is allocating memory twice (once for the object, and again for the pointer to the pointer to the object in the list). Since case 2 is only allocating memory once each insert, it will be faster.

The list container is, in general, not cache friendly. There is no guarantee that sequential nodes will be in sequential memory blocks, so when iterating through it, a list a pointers will be faster since it is more likely to be in sequential blocks than a list of objects. The same would be true for deleting the entire list (since it is iterating the list again).

If you wanted to be more cache-friendly, use a vector (but then insertions and deletions in the middle will be more expensive).

Upvotes: 1

Mark B

Reputation: 96241

I don't see any optimization settings in your build commands so presumably you're getting an unoptimized build. It's entirely believeable that in such a build the extra level of indirection (and/or the fact that the list nodes are smaller) actually improves performance by chance/library implementation.

Try compiling with at least -O2 enabled and see what happens.

Upvotes: 2

Why is iterating a list of objects slower than iterating a list of object pointers?

Answers (6)

Related Questions