Reputation: 25
I have the following piece of c code,
double findIntraClustSimFullCoverage(cluster * pCluster)
{
double sum = 0;
register int i = 0, j = 0;
double perElemSimilarity = 0;
for (i = 0; i < 10000; i++)
{
perElemSimilarity = 0;
for (j = 0; j < 10000; j++)
{
perElemSimilarity += arr[i][j];
}
perElemSimilarity /= pCluster->size;
sum += perElemSimilarity;
}
return (sum / pCluster->size);
}
NOTE:
arr
is a matrix of size 10000 X 10000
This is a portion of a GA code, hence this nested for loop runs many times. This affects the performance of the code i.e. takes hell a lot of time to give the results. I profiled the code using valgrind / kcachegrind. This indicated that 70 % of the process execution time was spent in running this nested for loop. The register variables i and j, do not seem to be stored in register values (profiling with and without "register" keyword indicated this)
I simply can not find a way to optimize this nested for loop portion of code (as it is very simple and straight forward). Please help me in optimizing this portion of code.
Upvotes: 0
Views: 3660
Reputation: 52538
The way this problem is stated, there isn't much you can do. You are processing 10,000 x 10,000 double input values, that's 800 MB. Whatever you do is limited by the time it takes to read 800 MB of data.
On the other hand, are you also writing 10,000 x 10,000 values each time this is called? If not, you could for example store the sums for each row and have a boolean row indicating that a row sum needs to be calculated, which is set each time you change a row element. Or you could even update the sum for a row each time an array element is change.
Upvotes: 0
Reputation: 8095
I'm assuming that you change the arr matrix frequently, else you could just compute the sum (see Lucian's answer) once and remember it.
You can use a similar approach when you modify the matrix. Instead of completely re-computing the sum after the matrix has (likely) been changed, you can store a 'sum' value somewhere, and have every piece of code that updates the matrix update the stored sum appropriately. For instance, assuming you start with an array of all zeros:
double arr[10000][10000];
< initialize it to all zeros >
double sum = 0;
// you want set arr[27][53] to 82853
sum -= arr[27][53];
arr[27][53] = 82853;
sum += arr[27][53];
// you want set arr[27][53] to 473
sum -= arr[27][53];
arr[27][53] = 473;
sum += arr[27][53];
You might want to completely re-calculate the sum from time to time to avoid accumulation of errors.
Upvotes: 1
Reputation: 5238
If you're sure that you have no option for algorithmic optimization, you'll have to rely on very low level optimizations to speed up your code. These are very platform/compiler specific so your mileage may vary.
It is probable that, at some point, the bottleneck of the operation is pulling the values of arr
from the memory. So make sure that your data is laid out in a linear cache friendly way. That is to say that &arr[i][j+1] - &arr[i][j] == sizeof(double)
.
You may also try to unroll your inner loop, in case your compiler does not already do it. Your code :
for (j = 0; j < 10000; j++)
{
perElemSimilarity += arr[i][j];
}
Would for example become :
for (j = 0; j < 10000; j+=10)
{
perElemSimilarity += arr[i][j+0];
perElemSimilarity += arr[i][j+1];
perElemSimilarity += arr[i][j+2];
perElemSimilarity += arr[i][j+3];
perElemSimilarity += arr[i][j+4];
perElemSimilarity += arr[i][j+5];
perElemSimilarity += arr[i][j+6];
perElemSimilarity += arr[i][j+7];
perElemSimilarity += arr[i][j+8];
perElemSimilarity += arr[i][j+9];
}
These are the basic ideas, difficult to say more without knowing your platform, compiler, looking at the generated assembly code.
You might want to take a look at this presentation for more complete examples of optimization opportunities.
If you need even more performance, you could take a look at SIMD intrinsics for your platform, of try to use, say OpenMP, to distribute your computation on multiple threads.
Another step would be to try with OpenMP, something along the following (untested) :
#pragma omp parallel for private(perElemSimilarity) reduction(+:sum)
for (i = 0; i < 10000; i++)
{
perElemSimilarity = 0;
/* INSERT INNER LOOP HERE */
perElemSimilarity /= pCluster->size;
sum += perElemSimilarity;
}
But note that even if you bring this portion of code to 0% (which is impossible) of your execution time, your GA algorithm will still take hours to run. Your performance bottleneck is elsewhere now that this portion of code takes 'only' 22% of your running time.
Upvotes: 1
Reputation: 2762
register
keyword is an optimizer hint, if the optimizer doesn't think the register is well spent there, it won't be.arr
fairly static? Is this called more than once on the same arr
? The result of the inner loop only depends on the row/column that j
traverses, so calculating it lazily and storing it for future reference will make a big differenceUpvotes: 0
Reputation: 258558
I might be wrong here, but isn't the following equivalent:
for (i = 0; i < 10000; i++)
{
for (j = 0; j < 10000; j++)
{
sum += arr[i][j];
}
}
return (sum / ( pCluster->size * pCluster->size ) );
Upvotes: 0