Reputation: 758
Rewriting the following function using std::find_if yields roughly 2x speed up, but only when given Intel CPU.
bool process_update_range_based(int treeId, int newHeight, bool tall,
std::vector<HeightRecord>& vec, std::vector<int>& updateSet) {
for (auto &hr : vec) {
if (hr.treeId == treeId) {
if (hr.height == newHeight)
return false;
else {
hr.height = newHeight;
updateSet.push_back(treeId);
return true;
}
}
}
vec.emplace_back(treeId, newHeight, tall);
return true;
}
The following measurements were taken: simulate 10 times, measure average and minimum time to execute N iterations. Minimal value is the most reliable, especially in browser when there is no data on external CPU clocks (Godbolt values do vary a lot). Compiler flags were -O3 -march=x86-64-v3
CPU | for loop avg | find_if avg | for loop min | find_if min |
---|---|---|---|---|
Godbolt GenuineIntel | N/A | N/A | 0.0275 | 0.0136 |
Godbolt AuthenticAMD | N/A | N/A | 0.0245 | 0.0212 |
Apple M2 | 0.0258 | 0.0235 | 0.0216 | 0.0216 |
AMD 5600H | 0.0288 | 0.0287 | 0.0244 | 0.0217 |
Intel i5-1335U | 0.0294 | 0.0152 | 0.0223 | 0.0125 |
Godbolt source code & benchmark
As you can see, on Intel CPUs std::find_if almost 2x faster compared to basic loop. Meanwhile AMD/Apple CPUs either 10-20% or completely the same. This behaviour is consistent across multiple compilers (g++, clang++, MSVC)
It's especially confusing since acceleration occures only when replacing range-based loop with library function. .asm code looks like some simple moves and comparisons. Godbolt with functions assembly only (no benchmark)
What causes this difference? What's so special about Intel CPUs? Why they execute the same assembly 2x faster than AMD/Apple CPUs?
UPD. The std::array version show similar results overall. However with std::array it seems that results differ more depending on order of execution:
Godbolt std::array source code & benchmark
Godbolt std::array source code
Upvotes: 6
Views: 178