Inverse
Inverse

Reputation: 4476

Optimize performance of loop

I've been profiling a bottleneck in my code (a function shown below) that gets called several million times. I could use tips on increasing the performance. The XXXs numbers were taken from Sleepy.

Compiled with visual studio 2013, /O2 and other typical release settings.

indicies is typically 0 to 20 values, and other parameters are the same size (b.size() == indicies.size() == temps.size() == temps[k].size()).

1:          double Object::gradient(const size_t j, 
2:                                  const std::vector<double>& b, 
3:                                  const std::vector<size_t>& indices, 
4:                                  const std::vector<std::vector<double>>& temps) const
5:  23.27s  {
6:              double sum = 0;
7:  192.16s     for (size_t k : indices)
8:  32.05s          if (k != j)
9:  219.53s             sum += temps[k][j]*b[k];
10:      
11: 320.21s     return boost::math::isfinite(sum) ? sum : 0;
13: 22.86s  }

Any ideas?

Thanks for the tips guys. Here were the results I got from the suggestions:

enter image description here

I found it interesting that switching to cbegin() and cend() had such a large impact. I guess the compiler isn't being as smart as it could there. I'm happy with the bump, but still curious if there's more room here through unrolling or vectorization.

For those interested, here my benchmark for isfinite(x):

boost::isfinite(x):
------------------------
SPEED: 761.164 per ms
TIME:  0.001314 ms
   +/- 0.000023 ms

std::isfinite(x):
------------------------
SPEED: 266.835 per ms
TIME:  0.003748 ms
   +/- 0.000065 ms

Upvotes: 12

Views: 8556

Answers (5)

kfsone
kfsone

Reputation: 24249

If you know that the conditional will be met (that in every iteration you will meet k == j), eliminate the conditional and replace the return condition with a simple conditional store.

double sum = -(temps[j][j]*b[j]);
for (size_t k : indices)
     sum += temps[k][j]*b[k];
if (!std::isfinite(sum))
     sum = 0.0;
return sum;

Range-based for is still new enough to not always get great optimization. You may also want to try:

const auto end = cend(indices);
for (auto it = cbegin(indices); it != end; ++it) {
    sum += temps[*it][j]*b[*it];
}

and see if the perf varies.

Upvotes: 6

valdo
valdo

Reputation: 12943

I believe you'll get a significant performance boost if you rearrange your temps. The following code line

temps[k][j]*b[k]

iterated through the values of k is (at least what I understand it to be) a multiplication if a (transposed) matrix by a vector. Now, accessing an element temps[k][j] actually involves reading temps[k], dereferencing it (a pointer to allocated vector), and then reading its [j] element.

Instead of using a vector<vector<double> you could use an 1-dimensional vector Moreover, since this function is your bottleneck, you could store the data in it in a "transposed" way. Means, accessing temps[k][j] in a new format turns into temps[j * N + k] where 'N' is the range of j. By this you also utilize the cache more efficiently.

Upvotes: 1

user4842163
user4842163

Reputation:

The first relatively simple optimization I'd attempt, is this:

const std::vector<std::vector<double>>& temps

This is potentially all over the place in memory and not cache-friendly whatsoever, as it's basically a dynamic array of dynamic arrays with a separate dynamic array for every single row (or column) of the matrix stored in a potentially completely different place in memory.

If you can create a "Matrix" class which stores the data in a single array (ex: single vector), that could help quite a bit.

The next thing I'd try is this:

7:  192.16s     for (size_t k : indices)
8:  32.05s          if (k != j)
9:  219.53s             sum += temps[k][j]*b[k];

We can see that 'j' never changes in the loop, so if 'temp' could be transposed in advance (also easy to do if you write a proper Matrix class), you could just access the memory sequentially from left to right and access multiple components of the matrix in a single cache line.

Last but not least, do you really run into infinity or NaN with the inputs you are working with? If the latter, I'd see if I could transfer the NaN check elsewhere in checking the validity of those temps arrays (provided you reuse the same ones multiple times). Unless you've already run into failures here or are working with a really mission-critical software, I'd try turning that check into a debug-mode assertion and only handle such an edge case if you actually encounter it in production.

Upvotes: 0

nv3
nv3

Reputation: 396

There are two points that stick out:

(a) boost::math::isfinite is taking relatively long. If possible, try to establish with other ways that your results are within the valid range.

(b) storing a 2D array as a nested vector is not the fasted way. It would probably be faster to make temp a 1D array and pass the row size as a parameter. For the access to a an element you will have to do the index calculation yourself. But as one of the indices (j) is constant over the loop, you can compute the start element's index before the loop and inside just increment the index 1. That might get you some significant improvement.

Upvotes: 4

PavPS
PavPS

Reputation: 76

I'm not sure that its possible in your solution, but please consider storing input data so that it feets to CPU cache. Besides other costs like implicit begin() end() calls which you might optimize in the end, cache misses issues will hurt you always. Do not use pointer to pointer in your perf-critical code. Please also try

for (const size_t& k : indices)

to avoid copying. However that could be optimized for you by a compiler.

Upvotes: 0

Related Questions