Reputation: 143

Interpreting bits in union fields as different datatypes in C/C++

I am trying to access Union bits as different datatypes. For example:

    typedef union {
    uint64_t x;
    uint32_t y[2];
    }test;

    test testdata;
    testdata.x = 0xa;
    printf("uint64_t: %016lx\nuint32_t: %08x %08x\n",testdata.x,testdata.y[0],testdata.y[1]);
    printf("Addresses:\nuint64_t: %016lx\nuint32_t: %p %p\n",&testdata.x,&testdata.y[0],&testdata.y[1]);

The output is

uint64_t: 000000000000000a
uint32_t: 0000000a 00000000
Addresses:
uint64_t: 00007ffe09d594e0
uint32_t: 0x7ffe09d594e0 0x7ffe09d594e4

The starting address pointed to by y is same as starting address of x. Since both fields use the same location, shouldn't the values of x be 00000000 0000000a ?

Why this is not happening? How can the internal conversion happen in a Union with different fields of different datatypes?

What needs to be done to retrieve the exact raw bits as uint32_t in the same order as in uint64_t using a union?

Edit: As mentioned in the comments, C++ gives undefined behaviour. How does it work in C? Can we actually do it?

Upvotes: 3

Answers (1)

Serge Ballesta

Reputation: 148965

I will first explain what happens in your implementation.

You are doing type punning between an uint64_t value and an array of 2 uint32_t values. According to the result, your system is little endian and gladly accepts that type punning by simply re-interpreting the byte representations. And the byte representation of 0x0a as a little endian uint64_t is:

Byte number  0    1    2    3    4    5    6    7  
Value        0x0a 0x00 0x00 0x00 0x00 0x00 0x00 0x00

The least significant byte in little endian has the lowest address. It is now evident why the uint32_t[2] representation is { 0x0a, 0x00 }.

But what you are doing is only legal in C language.

C language:

C11 says as 6.5.2.3 Structure and union members:

3 A postfix expression followed by the . operator and an identifier designates a member of a structure or union object. The value is that of the named member,⁹⁵⁾ and is an lvalue if the first expression is an lvalue.

The ⁹⁵⁾ note says explicitly:

If the member used to read the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be a trap representation.

So even if notes are not normative, their intent is to make clear the way the standard should be interpreted => you code is valid and has defined behaviour on a little endian system defining uint64_t and uint32_t types.

C++ language:

C++ is more strict in that part. Draft n4659 for C++17 says in [basic.lval]:

8 If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:⁵⁶
(8.1) — the dynamic type of the object,
(8.2) — a cv-qualified version of the dynamic type of the object,
(8.3) — a type similar (as defined in 7.5) to the dynamic type of the object,
(8.4) — a type that is the signed or unsigned type corresponding to the dynamic type of the object,
(8.5) — a type that is the signed or unsigned type corresponding to a cv-qualified version of the dynamic type of the object,
(8.6) — an aggregate or union type that includes one of the aforementioned types among its elements or nonstatic data members (including, recursively, an element or non-static data member of a subaggregate or contained union),
(8.7) — a type that is a (possibly cv-qualified) base class type of the dynamic type of the object,
(8.8) — a char, unsigned char, or std::byte type.

And the note ⁵⁶ says explictely:

The intent of this list is to specify those circumstances in which an object may or may not be aliased.

As punning is never referenced in C++ standard and as the struct/union part does not contain the equivalent of the re-interpretation of C, that means that reading in C++ the value of a member that is not the one that was last written invokes undefined behaviour.

Of course common compiler implementation compile both C and C++, and most of them accept the C idiom even in C++ source, for the very same reason that gcc C++ compiler gladly accepts VLA in C++ source files. After all, undefined behaviour includes expected results... But you should not rely on that for portable code.

Upvotes: 6

Interpreting bits in union fields as different datatypes in C/C++

Answers (1)

C language:

C++ language:

Related Questions