Optimization when looping through two vectors of different lengths

Question

#include 
#include 
#include 
#include 
#include 

using namespace std;

int main()
{
    struct timeval timeStart,
                    timeEnd;

Create vectors of random 0 and 1. We will benchmark the time to sum them up.

    int n1 = 450000000;  // size of vector v1
    int n2 = 500000000;  // size of vector v2
    int i;
    vector v1(n1);
    vector v2(n2);

    for (i=0; i < n1 ; i++)
    {
    v1[i] = rand() % 2;
    }
    for (i=0; i < n2 ; i++)
    {
    v2[i] = rand() % 2;
    }

First technique to sum. Sum these vectors with two complete (independent) loops

    int sum1 = 0;
    int sum2 = 0;
    gettimeofday(&timeStart, NULL);
    for (i=0; i < n1 ; i++)
    {
      sum1 += v1[i];
    }
    for (i=0; i < n2 ; i++)
    {
      sum2 += v2[i];
    }
    gettimeofday(&timeEnd, NULL);
    cout << "Two complete loops took " << ((timeEnd.tv_sec - timeStart.tv_sec) * 1000000 + timeEnd.tv_usec - timeStart.tv_usec) << " us"  << endl;

Second technique. Sum these vectors with a complete loop and a partial loop

    sum1 = 0;
    sum2 = 0;
    gettimeofday(&timeStart, NULL);
    for (i=0; i < n1 ; i++)
    {
      sum1 += v1[i];
      sum2 += v2[i];
    }
    for (i=n1; i < n2 ; i++)
    {
      sum2 += v2[i];
    }
    gettimeofday(&timeEnd, NULL);
    cout << "With a reduced second loop, it took " << ((timeEnd.tv_sec - timeStart.tv_sec) * 1000000 + timeEnd.tv_usec - timeStart.tv_usec) << " us"  << endl;

return 0;
}

I systematically get an output of the kind

Two complete loops took 13291126 us
With a reduced second loop, it took 12758827 us

I would have expected either the same time (if the compiler optimized the first solution as I excepted it to) or I expected the complete two loops to take considerably more time (and not just 5%-10% longer) than the partial second loop.

What is the compiler most likely doing here? Should I consider using partial loops in the future when looping through two vectors of different lengths?

FYI, I compiled with g++ -std=c++11 -o test test.cpp, with version

g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.3.0
Thread model: posix

Jean-Fran&#231;ois Fabre · Accepted Answer

A try to explain the similarities in the execution times:

When you do this:

for (i=0; i < n1 ; i++)
{
  sum1 += v1[i];
}
for (i=0; i < n2 ; i++)
{
  sum2 += v2[i];
}

you perform 2 loops so more instructions, but you read contiguous memory in both cases: caches work in an optimal way (What takes most time in "modern" computers is more memory access/cache misses than executing code)

BTW I doubt that the compiler could group those 2 loops.

The second case takes less control instruction count, but the memory isn't read contiguously.

Also: optimizer use to "unroll" loops, thus reducing the negative effect of control instruction.

So what you gain on one side, you lose on the other side. Those optimizations need to be benched, and you could have greater variations depending on the processor architecture.

Optimization when looping through two vectors of different lengths

Answers (1)

Related Questions