Does hardware consolidate multiple code operations into one physical CPU operation?

I've read a 2006 article about how the CPUs do operations on whole l1 cache lines even in situations when you only need to do something with a small fraction of what the l1 line contains(e.g. loading a whole l1 line to write to a Boolean variable is obviously overkill). The article encouraged optimization through managing the memory in a l1 cache friendly way.

Let's say I have two int variables that just happen to be consecutive in memory, and in my code I write to both of them consecutively.

Does the hardware consolidate my two code operations into one physical operation on a single l1 line(granted the CPU has a l1 cache line big enough to hold both variables), or not?

Is there some way to suggest such a thing to the CPU in C++ or C?

If the hardware doesn't do consolidation in any way, then do you think that it can yield better performance if such a thing is implemented in code? Allocating a memory block that is the size of the l1 line and populating it with as many hot-data variables as possible?

Upvotes: 4

Answers (4)

Peter Cordes

Reputation: 364069

Yes, back-to-back writes to adjacent int32_t of a cache line can be coalesced in the store buffer of some CPUs, so they can commit to L1d as a single 8-byte aligned update. (On many non-x86 CPUs, full 32-bit stores avoids an RMW cycle when updating L1d cache, so coalescing byte stores is good: Are there any modern CPUs where a cached byte store is actually slower than a word store?. On Alpha 21264, even coalescing 32-bit stores into 64-bit commits was important).

But coalescing in the store buffer only happens after multiple store instructions executed separately. There aren't CPUs that fuse consecutive loads or stores into a single hardware operation for an execution unit.

Some compilers (e.g. GCC8 and later, IIRC) can coalesce loads/stores to adjacent struct members or local variables into a single asm instruction, e.g. storing 4 chars at once with one 32-bit store. (Or 2 ints on a 64-bit machine). On some ISAs like x86, it will do this even if alignment isn't known.

This does create a single asm operation that accesses multiple C objects. On ISAs with efficient unaligned loads/stores (like x86), it's very often a win. (Cache line splits are not common, and not too expensive. Splits across a 4k boundary are much more expensive on Intel before Skylake, though, like ~100 cycles.)

Using alignas(8) int foo; on a struct member to give the whole struct more alignment might enable this compile-time optimization on ISAs without efficient unaligned loads/stores.

I think ARM ldp/stp (load/store pair) is not bad in the not-fully-aligned case, but in the aligned case it can load or store a pair of registers as a single 64 or 128-bit operation.

Upvotes: 1

Steve

Reputation: 493

The whole point of caching is to allow a lot of highly localized memory operations to happen quickly.

The fastest operations involve registers, of course. The only delay involved in using them is in the instruction fetching, decoding, and execution. In some register rich architectures (and in vector processors), they're actually used like a specialized cache. And all but the slowest processors have one or more levels of cache that looks like memory to ordinary instructions, except faster.

To simplify relative to actual processors, consider a hypothetical processor that runs at 2 GHz (0.5 ns per clock), with memory that takes 5 ns to load an arbitrary 64 bit (8 byte) word of memory, but only 1 ns to load each successive 64 bit word from memory. (Assume also that writes are similar.) On such a machine, flipping a bit in memory is pretty slow: 1 ns to load the instruction (only if it's not already in the pipeline – but 5 ns after a distant branch), 5 ns to load the word containing the bit, 0.5 ns to execute the instruction, and 5 ns to write the changed word back to memory. A memory copy is better: approximately zero to load instructions (since the pipeline presumably does the right thing with loops of instructions), 5 ns to load the first 8 bytes, 0.5 ns to execute an instruction, 5 ns to store the first 8 bytes, and 1+0.5+1 ns for each additional 8 bytes. Locality makes things easier. But some operations can be pathological: incrementing each byte of an array does the initial 5 ns load, the 0.5 ns instruction, the initial 5 ns store, then 1+0.5+1 per byte (rather than per word) thereafter. (A memory copy that doesn't fall on the same word boundaries is also bad news.)

To make this processor faster, we can add a cache that improves loads and stores to just 0.5 ns over the instruction execution time, for data that's in cache. The memory copy doesn't improve on reading, since it still costs 5 ns for the first 8 byte work and 1 ns for additional words, but the writes get much faster: 0.5 ns for every word until the cache fills, and at the normal 5+1+1, etc. rate after it fills, in parallel with other work that uses memory less. The byte increments improve to 5 ns for the initial load, 0.5+0.5 ns for the instruction and write, then 0.5+0.5+0.5 ns for each additional byte, except during cache stalls on read or write. More repetition of the same few addresses increase the proportion of cache hits.

What happens with real processors, multiple levels of cache, etc.? The simple answer is that things get more complicated. Writing cache-aware code includes trying to improve locality of memory access, analysis to avoid thrashing the cache, and a whole lot of profiling.

Upvotes: 2

edA-qa mort-ora-y

Reputation: 31851

The size of the cache line is primarily relevant for concurrency. It is the smallest block of data that can be synchronized between multiple processors.

It is also as you suggest, necessary to load the whole cache line to do an operation on just a few bytes of it. If you do multuiple operations on the same processor though it won't need to continually reload it. It is actually cached, as the name suggests. This includes caching the writes to the data. So long as only one processor is accessing the data you can usually rest assured that it is doing it efficiently.

In cases where multiple processors access the data it may be helpful to align data. Using the C++ alignas attribute, or compiler extensions, can help you get data structures that are aligned in the way you want.

You may be interested in my article CPU Reordering – What is actually being reordered? which gives a few insights to what happens (at least logically) at the low level.

Upvotes: 5

Mats Petersson

Reputation: 129314

It is quite a broad question, but I will try to cover the main points.

Yes, reading data into the cache ONLY look at a single bool is a bit wasteful - however, the processor typically don't KNOW what you are planning to do after that, if you for example need the next consecutive value or not. You can rely on data being in the same class or struct to be located next/near to each other, so using that to store data that you often operate on together close to each other will give you that benefit.

As for operating on "more than one piece of data at once", most modern processors have various forms of extensions to do the same operation on multiple data items (SIMD - same instruction, multiple data). This started with MMX in the late 1990's, and has been extended to include 3DNow!, SSE and AVX for x86. In ARM there is the "Neon" extension, which also provides similar functionality. PowerPC also have something similar whose name escapes me at the moment.

There is no way for a C or C++ program to immediately control the choice of instruction, or cache-usage. But modern compilers, given the right options, will produce code that for example use SIMD instructions to sum all int in a larger array, by adding 4 items at a time, and then, when the whole lot has been done, horizontally add the 4 values. Or if you have a set of X, Y, Z coordinates, it may well use SIMD to add two sets of such data together. It is the choice of the compiler to do this, but it is something that can save quite a bit of time, so the optimisers in the compiler is being modified to find cases where this helps, and use these types of instructions.

And finally, most larger modern processors (x86 since ca 1995, ARM A15, PowerPC) also do superscalar execution - executing more than one instruction at a time, and with "out of order execution" (the processor understands the dependencies of the instructions and executes ones that are "ready" to execute, not exactly in the order that they were given to the processor). The compiler will know this and try to "help" arrange the code such that the processor gets an easy task.

Upvotes: 3

Does hardware consolidate multiple code operations into one physical CPU operation?

Answers (4)

Related Questions