Reputation: 237
for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++)
{
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
{
int num = matBigger[i+ii][j+jj];
// Extract range from this.
int low = num & 0xff;
int high = num >> 8;
if (low < matSmaller[ii][jj] && matSmaller[ii][jj] > high)
// match found
}
}
The machine is x86_64, 32kb L1 cahce, 256 Kb L2 cache.
Any pointers on how can I possibly optimize this code?
EDIT Some background to the original problem : Fastest way to Find a m x n submatrix in M X N matrix
Upvotes: 1
Views: 347
Reputation: 57678
Looks like there is a lot of repetition here. One optimization is to reduce the amount of duplicate effort. Using pen and paper, I'm showing the matBigger
"i" index iterating as:
[0 + 0], [0 + 1], [0 + 2], ..., [0 + 19],
[1 + 0], [1 + 1], ..., [1 + 18], [1 + 19]
[2 + 0], ..., [2 + 17], [2 + 18], [2 + 19]
As you can see there are locations that are accessed many times. Also, multiplying the iteration counts indicate that the inner content is accessed: 20 * 20 * 5000 * 5000, or 10000000000 (10E+9) times. That's a lot!
So rather than trying to speed up the execution of 10E9 instructions (such as execution (pipeline) cache or data cache optimization), try reducing the number of iterations.
The code is searcing the matrix for a number that is within a range: larger than a minimal value and less than the maximum range value.
Based on this, try a different approach:
The objective is to reduce the number of duplicate accesses. Anchor points allow for a one pass scan and allow other decisions such as finding a range or determining an MxN matrix that contains the anchor value.
Another idea is to create new data structures containing the matBigger
and matSmaller
that are more optimized for searching.
For example, create a {value, coordinate list} entry for each unique value in matSmaller
:
Value coordinate list
26 -> (2,3), (6,5), ..., (1007, 75)
31 -> (4,7), (2634, 5), ...
Now you can use this data structure to find values in matSmaller
and immediately know their locations. So you could search matBigger
for each unique value in this data structure. This again reduces the number of access to the matrices.
Upvotes: 0
Reputation: 8715
I agree with Steve about rearranging your loops to have the higher count as the inner loop. Since your code is only doing loads and compares, I believe a significant portion of the time is used for pointer arithmetic. Try an experiment to change Steve's answer into this:
for (int ii = 0; ii < 20; ii++)
{
for (int jj = 0; jj < 20; jj++)
{
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
{
int *pI = &matBigger[i+ii][jj];
for (int j = 0; j < 5000; j++)
{
int num = *pI++;
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
} // for j
} // for i
} // for jj
} // for ii
Even in 64-bit mode, the C compiler doesn't necessarily do a great job of keeping everything in register. By changing the array access to be a simple pointer increment, you'll make the compiler's job easier to produce efficient code.
Edit: I just noticed @unwind suggested basically the same thing. Another issue to consider is the statistics of your comparison. Is the low or high comparison more probable? Arrange the conditional statement so that the less probable test is first.
Upvotes: 0
Reputation: 279225
First thing I'd try is to move the ii
and jj
loops outside the i
and j
loops. That way you're using the same elements of matSmaller
for 25 million iterations of the i
and j
loops, meaning that you (or the compiler if you're lucky) can hoist the access to them outside those loops:
for (int ii = 0; ii < 20; ii++)
for (int jj = 0; jj < 20; jj++)
int smaller = matSmaller[ii][jj];
for (int i = 0; i < 5000; i++)
for (int j = 0; j < 5000; j++) {
int num = matBigger[i+ii][j+jj];
int low = num & 0xff;
if (low < smaller && smaller > (num >> 8)) {
// match found
}
}
This might be faster (thanks to less access to the matSmaller
array), or it might be slower (because I've changed the pattern of access to the matBigger
array, and it's possible that I've made it less cache-friendly). A similar alternative would be to move the ii
loop outside i
and j
and hoist matSmaller[ii]
, but leave the jj
loop inside. The rule of thumb is that it's more cache-friendly to increment the last index of a multi-dimensional array in your inner loops, than earlier indexes. So we're "happier" to modify jj
and j
than we are to modify ii
and i
.
Second thing I'd try - what's the type of matBigger
? Looks like the values in it are only 16 bits, so try it both as int
and as (u)int16_t
. The former might be faster because aligned int
access is fast. The latter might be faster because more of the array fits in cache at any one time.
There are some higher-level things you could consider with some early analysis of smaller
: for example if it's 0
then you needn't examine matBigger
for that value of ii
and jj
, because num & 0xff < 0
is always false.
To do better than "guess things and see whether they're faster or not" you need to know for starters which line is hottest, which means you need a profiler.
Upvotes: 4
Reputation: 5949
What are matSmaller
and matBigger
?
Try changing them to matBigger[i+ii * COL_COUNT + j+jj]
Upvotes: 0
Reputation: 14149
Best thing ist to understand what the code is supposed to do, then check whether another algorithm exists for this problem.
Apart from that:
// match found
.operator()(int,int,int)
for accessing elements.Upvotes: 1
Reputation: 399743
Some basic advice:
const
in the innermost scope, to hint more to the compiler.high
if the low
test is failing.matBigger
and matSmaller
explicitly, to the innermost stepping into a simple increment.Upvotes: 2