c++ loop unroll performance

Question

I was reading 'C++ Template complete guide' book, part about meta programming. There is an example of loop unroll (17.7). I've implemented the program for dot product calculations:

#include 
#include 

using namespace std;

template
struct Functor
{
    static T dot_product(T *a, T *b)
    {
        return *a * *b + Functor::dot_product(a + 1, b + 1);
    }
};

template
struct Functor<1, T>
{
    static T dot_product(T *a, T *b)
    {
        return *a * *b;
    }
};


template
T dot_product(T *a, T *b)
{
    return Functor::dot_product(a, b);
}

double dot_product(int DIM, double *a, double *b)
{
    double res = 0;
    for (int i = 0; i < DIM; ++i)
    {
        res += a[i] * b[i];
    }
    return res;
}


int main(int argc, const char * argv[])
{
    static const int DIM = 100;

    double a[DIM];
    double b[DIM];

    for (int i = 0; i < DIM; ++i)
    {
        a[i] = i;
        b[i] = i;
    }


    {
        timeval startTime;
        gettimeofday(&startTime, 0);

        for (int i = 0; i < 100000; ++i)
        {
            double res = dot_product(a, b); 
            //double res = dot_product(DIM, a, b);
        }

        timeval endTime;
        gettimeofday(&endTime, 0);

        double tS = startTime.tv_sec * 1000000 + startTime.tv_usec;
        double tE = endTime.tv_sec   * 1000000 + endTime.tv_usec;

        cout << "template time: " << tE - tS << endl;
    }

    {
        timeval startTime;
        gettimeofday(&startTime, 0);

        for (int i = 0; i < 100000; ++i)
        {
            double res = dot_product(DIM, a, b);
        }

        timeval endTime;
        gettimeofday(&endTime, 0);

        double tS = startTime.tv_sec * 1000000 + startTime.tv_usec;
        double tE = endTime.tv_sec   * 1000000 + endTime.tv_usec;

        cout << "loop time: " << tE - tS << endl;
    }

    return 0;
}

I'm using xcode and I turned all code optimisations off. I expected that template version have to be faster then simple loop according to the book. But the results are (t - Template, l = Loop):

DIM 5: t = ~5000, l = ~3500

DIM 50: t = ~55000, l = 16000

DIM 100: t = 130000, l = 36000

Also i've tried to make template functions inline with no performance difference.

Why simple loop is so much faster?

Charles Salvia · Accepted Answer

Depending on the compiler, if you don't turn on performance optimizations, loop unrolling might not occur.

It's pretty easy to understand why: your recursive template instantiations are basically creating a series of functions. The compiler can't turn all of that into an inlined, unrolled loop and still keep sensible debugging information available. Suppose a segfault happens somewhere inside one of your functions, or an exception is thrown? Wouldn't you want to be able to get a stack-trace that showed each frame? The compiler thinks you might want that, unless you turn on optimizations, which gives your compiler permission to go to town on your code.

c++ loop unroll performance

Answers (1)

Related Questions