Reputation: 4476
I've been profiling a bottleneck in my code (a function shown below) that gets called several million times. I could use tips on increasing the performance. The XXXs
numbers were taken from Sleepy.
Compiled with visual studio 2013, /O2
and other typical release settings.
indicies
is typically 0 to 20 values, and other parameters are the same size (b.size() == indicies.size() == temps.size() == temps[k].size()
).
1: double Object::gradient(const size_t j,
2: const std::vector<double>& b,
3: const std::vector<size_t>& indices,
4: const std::vector<std::vector<double>>& temps) const
5: 23.27s {
6: double sum = 0;
7: 192.16s for (size_t k : indices)
8: 32.05s if (k != j)
9: 219.53s sum += temps[k][j]*b[k];
10:
11: 320.21s return boost::math::isfinite(sum) ? sum : 0;
13: 22.86s }
Any ideas?
Thanks for the tips guys. Here were the results I got from the suggestions:
I found it interesting that switching to cbegin()
and cend()
had such a large impact. I guess the compiler isn't being as smart as it could there. I'm happy with the bump, but still curious if there's more room here through unrolling or vectorization.
For those interested, here my benchmark for isfinite(x)
:
boost::isfinite(x):
------------------------
SPEED: 761.164 per ms
TIME: 0.001314 ms
+/- 0.000023 ms
std::isfinite(x):
------------------------
SPEED: 266.835 per ms
TIME: 0.003748 ms
+/- 0.000065 ms
Upvotes: 12
Views: 8556
Reputation: 24249
If you know that the conditional will be met (that in every iteration you will meet k == j
), eliminate the conditional and replace the return condition with a simple conditional store.
double sum = -(temps[j][j]*b[j]);
for (size_t k : indices)
sum += temps[k][j]*b[k];
if (!std::isfinite(sum))
sum = 0.0;
return sum;
Range-based for is still new enough to not always get great optimization. You may also want to try:
const auto end = cend(indices);
for (auto it = cbegin(indices); it != end; ++it) {
sum += temps[*it][j]*b[*it];
}
and see if the perf varies.
Upvotes: 6
Reputation: 12943
I believe you'll get a significant performance boost if you rearrange your temps
. The following code line
temps[k][j]*b[k]
iterated through the values of k
is (at least what I understand it to be) a multiplication if a (transposed) matrix by a vector. Now, accessing an element temps[k][j]
actually involves reading temps[k]
, dereferencing it (a pointer to allocated vector), and then reading its [j]
element.
Instead of using a vector<vector<double>
you could use an 1-dimensional vector
Moreover, since this function is your bottleneck, you could store the data in it in a "transposed" way. Means, accessing temps[k][j]
in a new format turns into temps[j * N + k]
where 'N' is the range of j
.
By this you also utilize the cache more efficiently.
Upvotes: 1
Reputation:
The first relatively simple optimization I'd attempt, is this:
const std::vector<std::vector<double>>& temps
This is potentially all over the place in memory and not cache-friendly whatsoever, as it's basically a dynamic array of dynamic arrays with a separate dynamic array for every single row (or column) of the matrix stored in a potentially completely different place in memory.
If you can create a "Matrix" class which stores the data in a single array (ex: single vector), that could help quite a bit.
The next thing I'd try is this:
7: 192.16s for (size_t k : indices)
8: 32.05s if (k != j)
9: 219.53s sum += temps[k][j]*b[k];
We can see that 'j' never changes in the loop, so if 'temp' could be transposed in advance (also easy to do if you write a proper Matrix class), you could just access the memory sequentially from left to right and access multiple components of the matrix in a single cache line.
Last but not least, do you really run into infinity or NaN with the inputs you are working with? If the latter, I'd see if I could transfer the NaN check elsewhere in checking the validity of those temps
arrays (provided you reuse the same ones multiple times). Unless you've already run into failures here or are working with a really mission-critical software, I'd try turning that check into a debug-mode assertion and only handle such an edge case if you actually encounter it in production.
Upvotes: 0
Reputation: 396
There are two points that stick out:
(a) boost::math::isfinite is taking relatively long. If possible, try to establish with other ways that your results are within the valid range.
(b) storing a 2D array as a nested vector is not the fasted way. It would probably be faster to make temp a 1D array and pass the row size as a parameter. For the access to a an element you will have to do the index calculation yourself. But as one of the indices (j) is constant over the loop, you can compute the start element's index before the loop and inside just increment the index 1. That might get you some significant improvement.
Upvotes: 4
Reputation: 76
I'm not sure that its possible in your solution, but please consider storing input data so that it feets to CPU cache. Besides other costs like implicit begin() end() calls which you might optimize in the end, cache misses issues will hurt you always. Do not use pointer to pointer in your perf-critical code. Please also try
for (const size_t& k : indices)
to avoid copying. However that could be optimized for you by a compiler.
Upvotes: 0