Annamalai
Annamalai

Reputation: 25

optimizing the nested for loop in c

I have the following piece of c code,

double findIntraClustSimFullCoverage(cluster * pCluster)
{
    double sum = 0;
    register int i = 0, j = 0;
    double perElemSimilarity = 0;

    for (i = 0; i < 10000; i++)
    {
        perElemSimilarity = 0;

        for (j = 0; j < 10000; j++)
        {

            perElemSimilarity += arr[i][j];

        }
        perElemSimilarity /= pCluster->size;
        sum += perElemSimilarity;
    }
    return (sum / pCluster->size);
}

NOTE: arr is a matrix of size 10000 X 10000

This is a portion of a GA code, hence this nested for loop runs many times. This affects the performance of the code i.e. takes hell a lot of time to give the results. I profiled the code using valgrind / kcachegrind. This indicated that 70 % of the process execution time was spent in running this nested for loop. The register variables i and j, do not seem to be stored in register values (profiling with and without "register" keyword indicated this)

I simply can not find a way to optimize this nested for loop portion of code (as it is very simple and straight forward). Please help me in optimizing this portion of code.

Upvotes: 0

Views: 3660

Answers (5)

gnasher729
gnasher729

Reputation: 52538

The way this problem is stated, there isn't much you can do. You are processing 10,000 x 10,000 double input values, that's 800 MB. Whatever you do is limited by the time it takes to read 800 MB of data.

On the other hand, are you also writing 10,000 x 10,000 values each time this is called? If not, you could for example store the sums for each row and have a boolean row indicating that a row sum needs to be calculated, which is set each time you change a row element. Or you could even update the sum for a row each time an array element is change.

Upvotes: 0

Ambroz Bizjak
Ambroz Bizjak

Reputation: 8095

I'm assuming that you change the arr matrix frequently, else you could just compute the sum (see Lucian's answer) once and remember it.

You can use a similar approach when you modify the matrix. Instead of completely re-computing the sum after the matrix has (likely) been changed, you can store a 'sum' value somewhere, and have every piece of code that updates the matrix update the stored sum appropriately. For instance, assuming you start with an array of all zeros:

double arr[10000][10000];
< initialize it to all zeros >
double sum = 0;

// you want set arr[27][53] to 82853
sum -= arr[27][53];
arr[27][53] = 82853;
sum += arr[27][53];

// you want set arr[27][53] to 473
sum -= arr[27][53];
arr[27][53] = 473;
sum += arr[27][53];

You might want to completely re-calculate the sum from time to time to avoid accumulation of errors.

Upvotes: 1

rotoglup
rotoglup

Reputation: 5238

If you're sure that you have no option for algorithmic optimization, you'll have to rely on very low level optimizations to speed up your code. These are very platform/compiler specific so your mileage may vary.

It is probable that, at some point, the bottleneck of the operation is pulling the values of arr from the memory. So make sure that your data is laid out in a linear cache friendly way. That is to say that &arr[i][j+1] - &arr[i][j] == sizeof(double).

You may also try to unroll your inner loop, in case your compiler does not already do it. Your code :

    for (j = 0; j < 10000; j++)
    {
        perElemSimilarity += arr[i][j];
    }

Would for example become :

    for (j = 0; j < 10000; j+=10)
    {
        perElemSimilarity += arr[i][j+0];
        perElemSimilarity += arr[i][j+1];
        perElemSimilarity += arr[i][j+2];
        perElemSimilarity += arr[i][j+3];
        perElemSimilarity += arr[i][j+4];
        perElemSimilarity += arr[i][j+5];
        perElemSimilarity += arr[i][j+6];
        perElemSimilarity += arr[i][j+7];
        perElemSimilarity += arr[i][j+8];
        perElemSimilarity += arr[i][j+9];
    }

These are the basic ideas, difficult to say more without knowing your platform, compiler, looking at the generated assembly code.

You might want to take a look at this presentation for more complete examples of optimization opportunities.

If you need even more performance, you could take a look at SIMD intrinsics for your platform, of try to use, say OpenMP, to distribute your computation on multiple threads.


Another step would be to try with OpenMP, something along the following (untested) :

#pragma omp parallel for private(perElemSimilarity) reduction(+:sum)
for (i = 0; i < 10000; i++)
{
    perElemSimilarity = 0;
    /* INSERT INNER LOOP HERE */
    perElemSimilarity /= pCluster->size;
    sum += perElemSimilarity;
}

But note that even if you bring this portion of code to 0% (which is impossible) of your execution time, your GA algorithm will still take hours to run. Your performance bottleneck is elsewhere now that this portion of code takes 'only' 22% of your running time.

Upvotes: 1

rvalue
rvalue

Reputation: 2762

  1. The register keyword is an optimizer hint, if the optimizer doesn't think the register is well spent there, it won't be.
  2. Is the matrix well packed, i.e. is it a contiguous block of memory?
  3. Is 'j' the minor index (i.e. are you going from one element to the next in memory), or are you jumping from one element to that plus 1000?
  4. Is arr fairly static? Is this called more than once on the same arr? The result of the inner loop only depends on the row/column that j traverses, so calculating it lazily and storing it for future reference will make a big difference

Upvotes: 0

Luchian Grigore
Luchian Grigore

Reputation: 258558

I might be wrong here, but isn't the following equivalent:

for (i = 0; i < 10000; i++)
{
    for (j = 0; j < 10000; j++)
    {
        sum += arr[i][j];
    }
}
return (sum / ( pCluster->size * pCluster->size ) );

Upvotes: 0

Related Questions