SIMEL
SIMEL

Reputation: 8931

Does it take more time to access a value in a pointed struct than a local value?

I have a struct which holds values that are used as the arguments of a for loop:

struct ARGS {
    int endValue;
    int step;
    int initValue;
}

ARGS * arg = ...; //get a pointer to an initialized struct
for (int i = arg->initValue; i < arg->endValue; i+=arg->step) {
    //...
}

Since the values of initValue and step are checked each iteration, would it be faster if I move them to local values before using in the for loop?

initValue = arg->initValue;
endValue = arg->endValue;
step = arg->step;

for (int i = initValue; i < endValue; i+=step) {
    //...
}

Upvotes: 2

Views: 314

Answers (3)

The clear cut answer is that in 99.9% of the cases it does not matter, and you should not be concerned with it. Now, there might be different micro differences that won't matter to mostly anyone. The gory details depend on the architecture and optimizer. But bear with me, understand not no mean very very high probability that there is no difference.

// case 1
ARGS * arg = ...; //get a pointer to an initialized struct
for (int i = arg->initValue; i < endValue; i+=arg->step) {
    //...
}

// case 2
initValue = arg->initValue;
step = arg->step;

for (int i = initValue; i < endValue; i+=step) {
    //...
}

In the case of initValue, there will not be a difference. The value will be loaded through the pointer and stored into the initValue variable, just to store it in i. Chances are that the optimizer will skip initValue and write directly to i.

The case of step is a bit more interesting, in that the compiler can prove that the local variable step is not shared by any other thread and can only change locally. If the pressure on registers is small, it can keep step in a register and never have to access the real variable. On the other hand, it cannot assume that arg->step is not changing by external means, and is required to go to memory to read the value. Understand that memory here means L1 cache most probably. A L1 cache hit on a Core i7 takes approximately 4 cpu cycles, which roughly means 0.5 * 10-9 seconds (on a 2Ghz processor). And that is under the worst case assumption that the compiler can maintain step in a register, which may not be the case. If step cannot be held on a register, you will pay for the access to memory (cache) in both cases.

Write code that is easy to understand, then measure. If it is slow, profile and figure out where the time is really spent. Chances are that this is not the place where you are wasting cpu cycles.

Upvotes: 4

6502
6502

Reputation: 114461

The problem is that the two version are not identical. If the code in the ... part modifies the values in arg then the two options will behave differently (the "optimized" one will use the step and end value using the original values, not the updated ones).

If the optimizer can prove by looking at the code that this is not going to happen then the performance will be the same because moving things out of loops is a common optimization performed today. However it's quite possible that something in ... will POTENTIALLY change the content of the structure and in this case the optimizer must be paranoid and the generated code will reload the values from the structure at each iteration. How costly it will be depends on the processor.

For example if the arg pointer is received as a parameter and the code in ... calls any external function for which the code is unknown to the compiler (including things like malloc) then the compiler must assume that MAY BE the external code knows the address of the structure and MAY BE it will change the end or step values and thus the optimizer is forbidden to move those computations out of the loop because doing so would change the behavior of the code.

Even if it's obvious for you that malloc is not going to change the contents of your structure this is not obvious at all for the compiler, for which malloc is just an external function that will be linked in at a later step.

Upvotes: 1

rjp
rjp

Reputation: 1820

This depends on your architecture. If it is a RISC or CISC processor, then that will affect how memory is accessed, and on top of that the addressing modes will affect it as well.

On the ARM code I work with, typically the base address of a structure will be moved into a register, and then it will execute a load from that address plus an offset. To access a variable, it will move the address of the variable into the register, then execute the load without an offset. In this case it takes the same amount of time.

Here's what the example assembly code might look like on ARM for accessing the second int member of a strcture compared to directly accessing a variable.

ldr r0, =MyStruct        ; struct {int x, int y} MyStruct
ldr r0, [r0, #4]         ; load MyStruct.y into r0

ldr r1, =MyIntY          ; int MyIntX, MyIntY
ldr r1, [r1]             ; directly load MyIntY into r0.

If your architecture does not allow addressing with offsets, then it would need to move the address into a register and then perform the addition of the offset.

Additionally, since you've tagged this as C++ as well, if you overload the -> operator for the type, then this will invoke your own code which could take longer.

Upvotes: 2

Related Questions