Dealing with data types smaller than the cpu data bus. (c++ into machine code.)

Question

In the following answer: https://softwareengineering.stackexchange.com/a/363379/370129

the way CPUs access memory is explained quite clearly. Assuming that we create a data type smaller than the CPUs data bus-as is a c++ char; how is the data-bus sized chunk of memory, read by the CPU, modified to be used in a register as the intended type? Is the specified byte shifted so that it occupies the least significant byte in the register if it isnt there already? Are then the excess(according to the type size) most significant bytes set to 0?

Can the CPU then write a/many modified individual byte/s to a/many memory address/es or does it have to write the entire bus-size chunk/s into the bus-size memory slot/s that are/is occupied by the byte/s used?

old_timer · Accepted Answer

It is very specific to the processor as to what happens. Some have multiple choices. The common options for a load byte instruction are to either zero the upper bits or to sign extend the upper bits. So a load of 0xAB into a 32 bit register if that is what your processor has would either be 0x000000AB or 0xFFFFFFAB depending. Some processors solve that problem other ways.

How the cpu bus works is determined by the CPU. Your processor is not going to be very successful if you don't have load/store byte (load/store halfword, load/store word) instructions. But there are different ways to implement the bus as can be seen with x86 and others that have evolved over time. For performance reasons today you generally need 32 or 64 or wider buses, a balance of too big has big penalties, too small has limited performance. Not required but often we have L1 caches and sometimes L2, the cache serves a number of reasons, but the cache is a fixed width is larger say 32 bits or 64 for example so a transfer that is smaller would require a read-modify-write for writes into that sram. Which is not the processors problem nor is it the problem to be solved on the bus, the bus/memory controller will capture the write information (address and data and size) and then deal with the sram or busses on the other side.

Reads are dealt with by the cpu typically if you do an 8 bit read on a 32 or 64 bit bus then the result will come back on a bus defined byte lane and the processor then knows from the instruction how much data to take off of the bus, where on the bus and what to do with it (go straight into an alu, into a register, sign or zero extend it, etc).

Because the target side is often a cache or a peripheral that is designed for this bus, reads don't necessarily need to be designed to indicate sub-bus size, their lengths are often in units of bus width, so a 128 bit transfer on a 32 bit bus would have a length of 4, the overhead happens between buses, and then ideally a burst of four clocks would move the data (vs four 32 bit transfers with all the overhead per transfer). But a single or sub size read would just show as a single width read and the processor isolates the bytes of interest.

For writes there is either a length indicator typically or a byte mask, if it is a 32 bit wide bus broken into 4 byte lanes then there would be a 4 bit mask to indicate to the other side which bytes of a write are valid and which ones to not use/apply, and that would drive a read-modify-write as needed. If for example on an arm you did an stm with three registers and your core is using a 64 bit wide bus then that would show up as two transfers one 32 bit and one 64 bit, if it uses a 32 bit wide bus then it would likely be a single transfer with a length of three. (although I have seen an arm bus not do writes with a length more than 2 widths of the bus).

There is always a penalty for the smaller transfers it is a matter of whether you can see it or not based no other processor/system overhead. An x86 you can't necessarily see the penalty because of the overhead, but an arm you sometimes can, doing four byte sized transfers vs one 32 bit transfer or even two 16s vs one 32. But it depends, this doesn't automatically mean you will see it, it just means you might. And understand that arm makes cores, not chips, so the bulk of the chip has nothing to do with arm but the overall performance has a lot to do with the chip vendor not arm.

Edit

Second attempt.

For writes the cpu buses typically support various sizes as Fuz indicated. The CPU (processor core) these days does not have to deal with the read-modify-writes that happen on the far side.

For reads the cpu bus generally reads a full bus width and the processor does have to deal with it. But the bus and processor are designed as a system. The processor depending on the instruction will extract the right number of bits and either zero pad or sign extend them.

This is all heavily processor/chip dependent.

I have seen and it makes sense that a single instruction can/may turn into multiple bus transactions, depending on the instruction, bus, address, size.

Dealing with data types smaller than the cpu data bus. (c++ into machine code.)

Answers (1)

Edit

Related Questions