ichigo
ichigo

Reputation: 21

Which is faster, a struct, or a primitive variable containing the same bytes?

Here is an example piece of code:

#include <stdint.h> 
#include <iostream>

typedef struct {
    uint16_t low;
    uint16_t high;
} __attribute__((packed)) A;

typedef uint32_t B;

int main() {
    //simply to make the answer unknowable at compile time
    uint16_t input;
    cin >> input;
    A a = {15,input};
    B b = 0x000f0000 + input;
    //a equals b
    int resultA = a.low-a.high;
    int resultB = b&0xffff - (b>>16)&0xffff;
    //use the variables so the optimiser doesn't get rid of everything
    return resultA+resultB;
}

Both resultA and resultB calculate the exact same thing - but which is faster (assuming you don't know the answer at compile time).

I tried using Compiler Explorer to look at the output, and I got something - but with any optimisation no matter what I tried it outsmarted me and optimised the whole calculation away (at first, it optimised everything away since it's not used) - I tried using cin to make the answer unknowable at runtime, but then I couldn't even figure out how it was getting the answer at all (I think it managed to still figure it out at compile time?)

Here is the output of Compiler Explorer with no optimisation flag:

        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        mov     dword ptr [rbp - 4], 0
        movabs  rdi, offset std::cin
        lea     rsi, [rbp - 6]
        call    std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
        mov     word ptr [rbp - 16], 15
        mov     ax, word ptr [rbp - 6]
        mov     word ptr [rbp - 14], ax
        movzx   eax, word ptr [rbp - 6]
        add     eax, 983040
        mov     dword ptr [rbp - 20], eax
Begin calculating result A
        movzx   eax, word ptr [rbp - 16]
        movzx   ecx, word ptr [rbp - 14]
        sub     eax, ecx
        mov     dword ptr [rbp - 24], eax
End of calculation
Begin calculating result B
        mov     eax, dword ptr [rbp - 20]
        mov     edx, dword ptr [rbp - 20]
        shr     edx, 16
        mov     ecx, 65535
        sub     ecx, edx
        and     eax, ecx
        and     eax, 65535
        mov     dword ptr [rbp - 28], eax
End of calculation
        mov     eax, dword ptr [rbp - 24]
        add     eax, dword ptr [rbp - 28]
        add     rsp, 32
        pop     rbp
        ret

I will also post the -O1 output, but I can't make any sense of it (I'm quite new to low level assembly stuff).

main:                                   # @main
        push    rax
        lea     rsi, [rsp + 6]
        mov     edi, offset std::cin
        call    std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
        movzx   ecx, word ptr [rsp + 6]
        mov     eax, ecx
        and     eax, -16
        sub     eax, ecx
        add     eax, 15
        pop     rcx
        ret

Something to consider. While doing operations with the integer is slightly harder, simply accessing it as an integer easier compared to the struct (which you'd have to convert with bitshifts I think?). Does this make a difference?

This originally came up in the context of memory, where I saw someone map a memory address to a struct with a field for the low bits and the high bits. I thought this couldn't possibly be faster than simply using an integer of the right size and bitshifting if you need the low or high bits. In this specific situation - which is faster?

[Why did I add C to the tag list? While the example code I used is in C++, the concept of struct vs variable is very applicable to C too]

Upvotes: 2

Views: 163

Answers (3)

Mestkon
Mestkon

Reputation: 4061

Using GCC on compiler explorer the version with the struct produces fewer instructions in -O3 mode.

Code:

#include <stdint.h> 

typedef struct {
    uint16_t low;
    uint16_t high;
} __attribute__((packed)) A;

typedef uint32_t B;

int f1(A a)
{
    return a.low - a.high;
}

int f2(B b)
{
    return b&0xffff - (b>>16)&0xffff;
}

Assembly:

_Z2f11A:
    movzwl  %di, %eax
    shrl    $16, %edi
    subl    %edi, %eax
    ret
_Z2f2j:
    movl    %edi, %edx
    movl    $65535, %eax
    shrl    $16, %edx
    subl    %edx, %eax
    andl    %edi, %eax
    ret

But this might be because the two functions don't do the same thing as - has a higher precedence than &. When comparing the B case which does the same thing as A, then the exact same assembly is produced.

Code:

int f3(B b)
{
    return (b&0xffff) - ((b>>16)&0xffff);
}

Assembly:

_Z2f3j:
    movzwl  %di, %eax
    shrl    $16, %edi
    subl    %edi, %eax
    ret

Note that the only way to find out if something is faster is to benchmark it in a real world use case.

Upvotes: 2

Yakk - Adam Nevraumont
Yakk - Adam Nevraumont

Reputation: 275878

Other than the fact that some ABIs require that structs be passed differently than integers, there won't be a difference.

Now, there are important semantic differences between two 16 bit ints and one 32 bit int. If you add to the lower 16 bit int, it will not "overflow" into the higher one, while if you add to the lower 16 bits of a 32 bit int, it will. This difference in possible behavior (even if you, yourself, "know" it could not happen in your code) could change what assembly code is generated by your compiler, and impact performance.

Which of those two would result in a faster result is not going to be knowable without actually testing or a full description of the actual exact problem. So it is a toss up there.

Which means the only real concern is the ABI one. This means, without whole program optimization, a function taking a struct and a function taking an int with the same binary layout will have a different assumptions about where the data is.

This only matters for by-value single arguments however.

The 90/10 rule applies; 90% of your code runs for less than 10% of the time. The odds are this will have no impact on your critical path.

Upvotes: 3

Mike Nakis
Mike Nakis

Reputation: 62130

When trying to answer questions of performance, examining unoptimized code is largely irrelevant.

As a matter of fact, even examining the results of -O1 optimization is not particularly useful, because it does not give you the best that the compiler can achieve. You should try at least -O2.

Regardless of the above, the sample code you provided is unsuitable for examination, because you should be making sure that the values of a and b are separately unknowable by the compiler. As the code stands, the compiler does not know what the value of input is, but it does know that a and b will have the same value, so it optimizes the code in ways that make it impossible to derive any useful conclusions from it.

As a general rule, compilers tend to do an exceptionally good job when dealing with structs that fit within machine words, to the point where generally, there is absolutely no performance difference between the two scenarios you are considering, and between any of the special cases you are pondering about.

Upvotes: 2

Related Questions