Frank
Frank

Reputation: 4461

Interpreting Intel VTune's Memory Bound Metric

I see the following when I run Intel VTune on my workload:

 Memory Bound                  50.8%             

I read the Intel doc, which says (Intel doc):

Memory Bound measures a fraction of slots where pipeline could be stalled due to demand load or store instructions. This accounts mainly for incomplete in-flight memory demand loads that coincide with execution starvation in addition to less common cases where stores could imply back-pressure on the pipeline.

Does that mean that roughly half of the instructions in my app are stalled waiting for memory, or is it more subtle than that?

Upvotes: 2

Views: 2549

Answers (2)

Hadi Brais
Hadi Brais

Reputation: 23679

A slot is an execution port of the pipeline. In general in the VTune documentation, a stall could either mean "not retired" or "not dispatched for execution". In this case, it refers to the number of cycles in which zero uops were dispatched.

According to the VTune include configuration files, Memory Bound is calculated as follows:

Memory_Bound = Memory_Bound_Fraction * BackendBound

Memory_Bound_Fraction is basically the fraction of slots mentioned in the documentation. However, according to the top-down method discussed in the optimization manual, the memory bound metric is relative to the backend bound metric. So this is why it is multiplied by BackendBound.

I'll focus on the first term of the formula, Memory_Bound_Fraction. The formula for the second term, BackendBound, is actually complicated.

Memory_Bound_Fraction is calculated as follows:

Memory_Bound_Fraction = (CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB) * NUM_OF_PORTS / Backend_Bound_Cycles * NUM_OF_PORTS

NUM_OF_PORTS is the number of execution ports of the microarchitecture of the target CPU. This can be simplified to:

Memory_Bound_Fraction = CYCLE_ACTIVITY.STALLS_MEM_ANY + RESOURCE_STALLS.SB / Backend_Bound_Cycles

CYCLE_ACTIVITY.STALLS_MEM_ANY and RESOURCE_STALLS.SB are performance events. Backend_Bound_Cycles is calculated as follows:

Backend_Bound_Cycles = CYCLE_ACTIVITY.STALLS_TOTAL + UOPS_EXECUTED.CYCLES_GE_1_UOP_EXEC - Few_Uops_Executed_Threshold - Frontend_RS_Empty_Cycles + RESOURCE_STALLS.SB

Few_Uops_Executed_Threshold is either UOPS_EXECUTED.CYCLES_GE_2_UOP_EXEC or UOPS_EXECUTED.CYCLES_GE_3_UOP_EXEC depending on some other metric. Frontend_RS_Empty_Cycles is either RS_EVENTS.EMPTY_CYCLES or zero depending on some metric.

I realize this answer still needs a lot of additional explanation and BackendBound needs to be expanded. But this early edit makes the answer accurate.

Upvotes: 2

rdb77
rdb77

Reputation: 79

The pipeline slots concept used by VTune is explain e.g. here: https://software.intel.com/en-us/top-down-microarchitecture-analysis-method-win. In short pipeline slot represents the hardware resources needed to process one uOp. So for 4-wide CPUs (most Intel processors) we can execute 4 Ops each cycle and the total number of slots will be measured as 4 * CPU_CLK_UNHALTED.THREAD by VTune. The Memory Bound metric is built on CYCLE_ACTIVITY.STALLS_MEM_ANY event which gives you directly stalls due to memory. Taking into account out-of-order. Basically only if CPU is stalled and at the same time it has in-flight loads the counter is incremented. If there are loads in-flight but CPU is kept busy it is not accounted as memory stall. So Memory Bound metric provides quite accurate estimation on how much the workload is bound by memory performance issues. The value of 50% means that half of the time was wasted waiting for data from memory.

Upvotes: 2

Related Questions