idle_cycles
idle_cycles

Reputation: 173

Are snoop requests sent to all the cores in a multi node setup?

I understand that intel uses home snooping coherency protocol in QPI and perhaps something more complex/dynamic (workload-specific) in UPI. But if a cache line is in I (INVALID) state to begin with while none of the other cores have it in their L1/L2, once the cache line is requested from home agent will the load request be also broadcasted to other local cores? I believe it does. However, will the load request be broadcasted to cores on a different node also?

Another possible explanation is: If not found in L2 then the L3 memory controller will be asked for it. The LLC controller will know which DIMM/core has the physical data requested (using a directory) and routes the request to the corresponding core via QPI/UPI. Next, the request is broadcasted amongst the cores in target node only by its L3 controller. Finally, the L2 controller will be informed about inter-node communication so L2 won't broadcast to other local cores. This implies requests are never broadcasted beyond a node.

I understand that this kind of information might not be available publically but any ideas are appreciated.

Upvotes: 2

Views: 476

Answers (1)

Hadi Brais
Hadi Brais

Reputation: 23669

But if a cache line is in I (INVALID) state to begin with while none of the other cores have it in their L1/L2, once the cache line is requested from home agent will the load request be also broadcasted to other local cores?

This is an implementation detail and is not part of the QPI specification. On all Intel processors starting with Nehalem, whether the L3 cache is inclusive or non-inclusive, each caching agent on the on-die interconnect has an inclusive directory for tracking the cache lines that it owns (i.e., whose physical address is mapped to it). So a snoop is never broadcasted to all local cores unless the directory indicates that all of them need to be snooped. On a miss in the L3 cache, the request is sent to the home agent of the target cache line.

will the load request be broadcasted to cores on a different node also?

This is also an implementation detail. It depends on the coherence mode. If the processor supports memory-level coherence directory and if that directory is enabled, then there is no need to broadcast for every request. Some processors support opportunistic broadcast (OSB). If OSB is enabled, the home agent may speculatively broadcasts a snoop if bandwdith is available. This is done in parallel with the directory lookup operation. If the directory lookup result indicates that there is no need to snoop other NUMA nodes, the home agent sends the requested data back without waiting for the snoop responses, thereby reducing latency.

Upvotes: 2

Related Questions