Reputation: 237

Optimizing this code block

for (int i = 0; i < 5000; i++)
   for (int j = 0; j < 5000; j++)
   {
      for (int ii = 0; ii < 20; ii++)
          for (int jj = 0; jj < 20; jj++)
           {
               int num = matBigger[i+ii][j+jj];
               // Extract range from this.
               int low = num & 0xff;
               int high = num >> 8;
               if (low < matSmaller[ii][jj] && matSmaller[ii][jj] > high)
                  // match found
           }
   }

The machine is x86_64, 32kb L1 cahce, 256 Kb L2 cache.

Any pointers on how can I possibly optimize this code?

EDIT Some background to the original problem : Fastest way to Find a m x n submatrix in M X N matrix

Upvotes: 1

Answers (6)

Thomas Matthews

Reputation: 57678

Looks like there is a lot of repetition here. One optimization is to reduce the amount of duplicate effort. Using pen and paper, I'm showing the matBigger "i" index iterating as:

[0 + 0], [0 + 1], [0 + 2], ..., [0 + 19],
         [1 + 0], [1 + 1], ..., [1 + 18], [1 + 19]
                  [2 + 0], ..., [2 + 17], [2 + 18], [2 + 19]

As you can see there are locations that are accessed many times. Also, multiplying the iteration counts indicate that the inner content is accessed: 20 * 20 * 5000 * 5000, or 10000000000 (10E+9) times. That's a lot!

So rather than trying to speed up the execution of 10E9 instructions (such as execution (pipeline) cache or data cache optimization), try reducing the number of iterations.

The code is searcing the matrix for a number that is within a range: larger than a minimal value and less than the maximum range value.

Based on this, try a different approach:

Find and remember all coordinates where the search value is greater than the low value. Let us call these anchor points.
For each anchor point, find the coordinates of the first value after the anchor point that is outside the range.

The objective is to reduce the number of duplicate accesses. Anchor points allow for a one pass scan and allow other decisions such as finding a range or determining an MxN matrix that contains the anchor value.

Another idea is to create new data structures containing the matBigger and matSmaller that are more optimized for searching.

For example, create a {value, coordinate list} entry for each unique value in matSmaller:

  Value    coordinate list
    26 -> (2,3), (6,5), ..., (1007, 75)
    31 -> (4,7), (2634, 5), ...

Now you can use this data structure to find values in matSmaller and immediately know their locations. So you could search matBigger for each unique value in this data structure. This again reduces the number of access to the matrices.

Upvotes: 0

BitBank

Reputation: 8715

I agree with Steve about rearranging your loops to have the higher count as the inner loop. Since your code is only doing loads and compares, I believe a significant portion of the time is used for pointer arithmetic. Try an experiment to change Steve's answer into this:

for (int ii = 0; ii < 20; ii++)
  {
  for (int jj = 0; jj < 20; jj++)
    {
    int smaller = matSmaller[ii][jj];
    for (int i = 0; i < 5000; i++)
      {
      int *pI = &matBigger[i+ii][jj];
      for (int j = 0; j < 5000; j++)
        {
        int num = *pI++;
        int low = num & 0xff;
        if (low < smaller && smaller > (num >> 8)) {
          // match found
        } // for j
      } // for i
    } // for jj
  } // for ii

Even in 64-bit mode, the C compiler doesn't necessarily do a great job of keeping everything in register. By changing the array access to be a simple pointer increment, you'll make the compiler's job easier to produce efficient code.

Edit: I just noticed @unwind suggested basically the same thing. Another issue to consider is the statistics of your comparison. Is the low or high comparison more probable? Arrange the conditional statement so that the less probable test is first.

Upvotes: 0

Steve Jessop

Reputation: 279225

First thing I'd try is to move the ii and jj loops outside the i and j loops. That way you're using the same elements of matSmaller for 25 million iterations of the i and j loops, meaning that you (or the compiler if you're lucky) can hoist the access to them outside those loops:

for (int ii = 0; ii < 20; ii++)
  for (int jj = 0; jj < 20; jj++)
    int smaller = matSmaller[ii][jj];
    for (int i = 0; i < 5000; i++)
      for (int j = 0; j < 5000; j++) {
        int num = matBigger[i+ii][j+jj];
        int low = num & 0xff;
        if (low < smaller && smaller > (num >> 8)) {
          // match found
        }
      }

This might be faster (thanks to less access to the matSmaller array), or it might be slower (because I've changed the pattern of access to the matBigger array, and it's possible that I've made it less cache-friendly). A similar alternative would be to move the ii loop outside i and j and hoist matSmaller[ii], but leave the jj loop inside. The rule of thumb is that it's more cache-friendly to increment the last index of a multi-dimensional array in your inner loops, than earlier indexes. So we're "happier" to modify jj and j than we are to modify ii and i.

Second thing I'd try - what's the type of matBigger? Looks like the values in it are only 16 bits, so try it both as int and as (u)int16_t. The former might be faster because aligned int access is fast. The latter might be faster because more of the array fits in cache at any one time.

There are some higher-level things you could consider with some early analysis of smaller: for example if it's 0 then you needn't examine matBigger for that value of ii and jj, because num & 0xff < 0 is always false.

To do better than "guess things and see whether they're faster or not" you need to know for starters which line is hottest, which means you need a profiler.

Upvotes: 4

Charles Beattie

Reputation: 5949

What are matSmaller and matBigger? Try changing them to matBigger[i+ii * COL_COUNT + j+jj]

Upvotes: 0

Axel

Reputation: 14149

Best thing ist to understand what the code is supposed to do, then check whether another algorithm exists for this problem.

Apart from that:

if you are just interested if a matching entry exists, make sure to break out of all 3 loops at the position of // match found.
make sure the data is stored in an optimal way. It all depends on your problem, but i.e. it could be more efficient to have just one array of size 5000*5000*20 and overload operator()(int,int,int) for accessing elements.

Upvotes: 1

unwind

Reputation: 399743

Some basic advice:

Profile it, so you can learn where the hot-spots are.
Think about cache locality, and the addresses resulting from your loop order.
Use more const in the innermost scope, to hint more to the compiler.
Try breaking it up so you don't compute high if the low test is failing.
Try maintaining the offset into matBigger and matSmaller explicitly, to the innermost stepping into a simple increment.

Upvotes: 2

Optimizing this code block

Answers (6)

Related Questions