Reputation: 651
While testing the work of custom heap manager (to replace system one) I have encountered some slowdowns in comparison to system heap.
I used AMD CodeAnalyst for profiling x64 application on Windows 7, Intel Xeon CPU E5-1620 v2 @ 3.70 GHz. And got the following results:
This block consumes about 90% of the time for the whole application run. We can see a lot of time spent on "cmp [rsp+18h], rax"
and "test eax, eax"
but no time spent on jumps right below the compares. Is it ok that jumps take no time? Is it because of branch prediction mechanism?
I changed the clause to the opposite and here what I've got (the results are a bit different in absolute numbers because I manually stopped profiling sessions - but still a lot of time is taken by compares):
There are so many calls to these compares that they become a bottle-neck... This is how I can interpret these results. And probably the best optimization is reworking the algorithm, right?
Upvotes: 1
Views: 45
Reputation: 365372
Intel and AMD CPUs both macro-fuse cmp/jcc
pairs into a single compare-and-branch uop (Intel) or macro-op (AMD). Intel SnB-family CPUs like yours can do this with some instructions that also write an output register, like and
, sub
/add
, inc
/dec
.
To really understand profiling data, you have to understand something about how the out-of-order pipeline works in the microarch you're tuning on. See the links at the x86 tag wiki, especially Agner Fog's microarch pdf.
You should also beware that profiling cycle counts can get charged to the instruction that's waiting for results, not the instruction that is slow to produce them.
Upvotes: 1