MetallicPriest
MetallicPriest

Reputation: 30745

Do typical multicore processors have muliple ports from L1 to L2

For typical x86 multicore processors, let us say, we have a processor with 2 cores and both cores encounter an L1 instruction cache miss when reading an instruction. Lets also assume that both of the cores are accessing data in addresses which are in separate cache lines. Would those two cores get data from L2 to L1 instruction cache simultaneously or would it be serialized? In other words, do we have multiple ports for L2 cache access for different cores?

Upvotes: 5

Views: 974

Answers (1)

osgx
osgx

Reputation: 94175

For typical x86 multicore processors, let us say, we have a processor with 2 cores

Ok, let use some early variant of Intel Core 2 Duo with two cores (Conroe). They have 2 CPU cores, 2 L1i caches and shared L2 cache.

and both cores encounter an L1 instruction cache miss when reading an instruction.

Ok, there will be miss in L1i to read next instruction (miss in L1d, when you access the data, works in similar way, but there are only reads from L1i and reads&writes from L1d). Each L1i with miss will generate request to next layer of memory hierarchy, to the L2 cache.

Lets also assume that both of the cores are accessing data in addresses which are in separate cache lines.

Now we must to know how the caches are organized (This is classic middle-detail cache scheme which is logically similar to real hardware). Cache is memory array with special access circuits, and it looks like 2D array. We have many sets (64 in this picture) and each set has several ways. When we ask cache to get data from some address, the address is split into 3 parts: tag, set index and offset inside cache line. Set index is used to select the set (row in our 2D cache memory array), then tags in all ways are compared (to find right column in 2D array) with tag part of the request address, this is done in parallel by 8 tag comparators. If there is tag in cache equal to request address tag part, cache have "hit" and cache line from the selected cell will be returned to the requester.

Ways and sets; 2D array of cache (image from http://www.cnblogs.com/blockcipher/archive/2013/03/27/2985115.html or http://duartes.org/gustavo/blog/post/intel-cpu-caches/) ways and lines of cache

The example where set index 2 was selected, and parallel tag comparators give a "hit" (tag equality) for the Way 1: cache tag comparators to check for hit

What is the "port" to some memory or to cache? This is hardware interface between external hardware blocks and the memory, which has lines for request address (set by external block, for L1 it is set by CPU, for L2 - by L1), access type (load or store; may be fixed for the port), data input (for stores) and data output with ready bit (set by memory; cache logic handles misses too, so it return data both on hit and on miss, but it will return data for miss later).

If we want to increase true port count, we should increase hardware: for raw SRAM memory array we should add two transistor for every bit to increase port count by 1; for cache we should duplicate ALL tag comparator logic. But this has too high cost, so there are no much multiported memory in CPU, and if it has several ports, the total count of true ports is small.

But we can emulate having of several ports. http://web.eecs.umich.edu/~twenisch/470_F07/lectures/15.pdf EECS 470 2007 slide 11:

Parallel cache access is harder than parallel FUs

  • fundamental difference: caches have state, FUs don’t
  • one port affects future for other ports

Several approaches used

  • true multi‐porting
  • multiple cache copies
  • virtual multi‐porting
  • multi‐banking (interleaving)
  • line buffers

Multi-banking (sometimes called slicing) is used by modern chips ("Intel Core i7 has four banks in L1 and eight banks in L2"; figure 1.6 from page 9 of ISBN 1598297546 (2011) - https://books.google.com/books?id=Uc9cAQAAQBAJ&pg=PA9&lpg=PA9 ). It means, that there are several hardware caches of smaller sizes, and some bits of request address (part of set index - think the sets - rows as splitted over 8 parts or having colored into interleaved rows) are used to select bank. Each bank has low number of ports (1) and function just like classic cache (and there is full set of tag comparators in each bank; but the height of bank - number of sets in it is smaller, and every tag in array is routed only to single tag comparator - cheap as in single ported cache).

Would those two cores get data from L2 to L1 instruction cache simultaneously or would it be serialized? In other words, do we have multiple ports for L2 cache access for different cores?

If two accesses are routed to different L2 banks (slices), then cache behave like multiported and can handle both requests at the same time. But if both are routed to the single bank with single port, they will be serialized for the cache. Cache serialization may cost several ticks and request will be stalled near port; CPU will see this as slightly more access latency.

Upvotes: 3

Related Questions