Reputation: 97008
If you want to convert uint64_t
to a uint8_t[8]
(little endian). On a little endian architecture you can just do an ugly reinterpret_cast<>
or memcpy()
, e.g:
void from_memcpy(const std::uint64_t &x, uint8_t* bytes) {
std::memcpy(bytes, &x, sizeof(x));
}
This generates efficient assembly:
mov rax, qword ptr [rdi]
mov qword ptr [rsi], rax
ret
However it is not portable. It will have different behaviour on a little endian machine.
For converting uint8_t[8]
to uint64_t
there is a great solution - just do this:
void to(const std::uint8_t* bytes, std::uint64_t &x) {
x = (std::uint64_t(bytes[0]) << 8*0) |
(std::uint64_t(bytes[1]) << 8*1) |
(std::uint64_t(bytes[2]) << 8*2) |
(std::uint64_t(bytes[3]) << 8*3) |
(std::uint64_t(bytes[4]) << 8*4) |
(std::uint64_t(bytes[5]) << 8*5) |
(std::uint64_t(bytes[6]) << 8*6) |
(std::uint64_t(bytes[7]) << 8*7);
}
This looks inefficient but actually with Clang -O2
it generates exactly the same assembly as before, and if you compile on a big endian machine it will be smart enough to use a native byte swap instruction. E.g. this code:
void to(const std::uint8_t* bytes, std::uint64_t &x) {
x = (std::uint64_t(bytes[7]) << 8*0) |
(std::uint64_t(bytes[6]) << 8*1) |
(std::uint64_t(bytes[5]) << 8*2) |
(std::uint64_t(bytes[4]) << 8*3) |
(std::uint64_t(bytes[3]) << 8*4) |
(std::uint64_t(bytes[2]) << 8*5) |
(std::uint64_t(bytes[1]) << 8*6) |
(std::uint64_t(bytes[0]) << 8*7);
}
Compiles to:
mov rax, qword ptr [rdi]
bswap rax
mov qword ptr [rsi], rax
ret
My question is: is there an equivalent reliably-optimised construct for converting in the opposite direction? I've tried this, but it gets compiled naively:
void from(const std::uint64_t &x, uint8_t* bytes) {
bytes[0] = x >> 8*0;
bytes[1] = x >> 8*1;
bytes[2] = x >> 8*2;
bytes[3] = x >> 8*3;
bytes[4] = x >> 8*4;
bytes[5] = x >> 8*5;
bytes[6] = x >> 8*6;
bytes[7] = x >> 8*7;
}
Edit: After some experimentation, this code does get compiled optimally with GCC 8.1 and later as long as you use uint8_t* __restrict__ bytes
. However I still haven't managed to find a form that Clang will optimise.
Upvotes: 14
Views: 3945
Reputation: 1255
What about returning a value? Easy to reason about and small assembly:
#include <cstdint>
#include <array>
auto to_bytes(std::uint64_t x)
{
std::array<std::uint8_t, 8> b;
b[0] = x >> 8*0;
b[1] = x >> 8*1;
b[2] = x >> 8*2;
b[3] = x >> 8*3;
b[4] = x >> 8*4;
b[5] = x >> 8*5;
b[6] = x >> 8*6;
b[7] = x >> 8*7;
return b;
}
and big endian:
#include <stdint.h>
struct mybytearray
{
uint8_t bytes[8];
};
auto to_bytes(uint64_t x)
{
mybytearray b;
b.bytes[0] = x >> 8*0;
b.bytes[1] = x >> 8*1;
b.bytes[2] = x >> 8*2;
b.bytes[3] = x >> 8*3;
b.bytes[4] = x >> 8*4;
b.bytes[5] = x >> 8*5;
b.bytes[6] = x >> 8*6;
b.bytes[7] = x >> 8*7;
return b;
}
(std::array not available for -target aarch64_be? )
Upvotes: 3
Reputation: 249552
The code you've given is way overcomplicated. You can replace it with:
void from(uint64_t x, uint8_t* dest) {
x = htole64(x);
std::memcpy(dest, &x, sizeof(x));
}
Yes, this uses the Linux-ism htole64()
, but if you're on another platform you can easily reimplement that.
Clang and GCC optimize this perfectly, on both little- and big-endian platforms.
Upvotes: 0
Reputation: 6012
First of all, the reason why your original from
implementation cannot be optimized is because you are passing the arguments by reference and pointer. So, the compiler has to consider the possibility that both of of them point to the very same address (or at least that they overlap). As you have 8 consecutive read and write operations to the (potentially) same address, the as-if rule cannot be applied here.
Note, that just by removing the the &
from the function signature, apparently GCC already considers this as proof that bytes
does not point into x
and thus this can safely be optimized. However, for Clang this is not good enough.
Technically, of course bytes
can point to from
's stack memory (aka. to x
), but I think that would be undefined behavior and thus Clang just misses this optimization.
Your implementation of to
doesn't suffer from this issue because you have implemented it in such a way that first you read all the values of bytes
and then you make one big assignment to x
. So even if x
and bytes
point to the same address, as you do all the reading first and all the writing afterwards (instead of mixing reads and writes as you do in from
), this can be optimized.
Flávio Toribio's answer works because it does precisely this: It reads all the values first and only then writes to the destination.
However, there are less complicated ways to achieve this:
void from(uint64_t x, uint8_t* dest) {
uint8_t bytes[8];
bytes[7] = uint8_t(x >> 8*7);
bytes[6] = uint8_t(x >> 8*6);
bytes[5] = uint8_t(x >> 8*5);
bytes[4] = uint8_t(x >> 8*4);
bytes[3] = uint8_t(x >> 8*3);
bytes[2] = uint8_t(x >> 8*2);
bytes[1] = uint8_t(x >> 8*1);
bytes[0] = uint8_t(x >> 8*0);
*(uint64_t*)dest = *(uint64_t*)bytes;
}
gets compiled to
mov qword ptr [rsi], rdi
ret
on little endian and to
rev x8, x0
str x8, [x1]
ret
on big endian.
Note, that even if you passed x
by reference, Clang would be able to optimize this. However, that would result in one more instruction each:
mov rax, qword ptr [rdi]
mov qword ptr [rsi], rax
ret
and
ldr x8, [x0]
rev x8, x8
str x8, [x1]
ret
respectively.
Also note, that you can improve your implementation of to
with a similar trick: Instead of passing the result by non-const reference, take the "more natural" approach and just return it from the function:
uint64_t to(const uint8_t* bytes) {
return
(uint64_t(bytes[7]) << 8*7) |
(uint64_t(bytes[6]) << 8*6) |
(uint64_t(bytes[5]) << 8*5) |
(uint64_t(bytes[4]) << 8*4) |
(uint64_t(bytes[3]) << 8*3) |
(uint64_t(bytes[2]) << 8*2) |
(uint64_t(bytes[1]) << 8*1) |
(uint64_t(bytes[0]) << 8*0);
}
Here are the best solutions I could get to for both, little endian and big endian. Note, how to
and from
are truly inverse operations that can be optimized to a no-op if executed one after another.
Upvotes: 2
Reputation: 4078
Here's what I could test based on the discussion in OP's comments:
void from_optimized(const std::uint64_t &x, std::uint8_t* bytes) {
std::uint64_t big;
std::uint8_t* temp = (std::uint8_t*)&big;
temp[0] = x >> 8*0;
temp[1] = x >> 8*1;
temp[2] = x >> 8*2;
temp[3] = x >> 8*3;
temp[4] = x >> 8*4;
temp[5] = x >> 8*5;
temp[6] = x >> 8*6;
temp[7] = x >> 8*7;
std::uint64_t* dest = (std::uint64_t*)bytes;
*dest = big;
}
Looks like this will make things clearer for the compiler and let it assume the necessary parameters to optimize it (both on GCC and Clang with -O2
).
Compiling to x86-64
(little endian) on Clang 8.0.0 (test on Godbolt):
mov rax, qword ptr [rdi]
mov qword ptr [rsi], rax
ret
Compiling to aarch64_be
(big endian) on Clang 8.0.0 (test on Godbolt):
ldr x8, [x0]
rev x8, x8
str x8, [x1]
ret
Upvotes: 3