Reordering bytes in memory, then writing to file

Question

I have a block of data in memory starting at memory_ptr. What this program does is byte reverse every qword and then write that to a file. For example, 18171615141312112827262524232221 will be written as 11121314151617182122232425262728 and so on.

I want this to run as fast as possible, so I've stored the flipped bytes in memory, and then write it to the file once its at 8KB to reduce the # of calls to fwrite(). However, I feel like there is a simpler and faster way of doing this rather than increasing the size of malloc to further reduce calls to fwrite. Any ideas on how to speed this up?

#define htonll(x) (((uint64_t)htonl((x) & 0xFFFFFFFF) << 32) | htonl((x) >> 32))

uint64_t data;
int total_chunks = 1000;
int chunk_size = 8192;
char *tmp = malloc(8192);
FILE *fp = fopen("data.bin","wb");

for (int i = 0; i < total_chunks; i++) {
    for (int j = 0; j < chunk_size; j+=8) {
        memcpy(&data, memory_ptr, 8);
        data = htonll(data);
        memcpy(tmp+j, &data, 8);
        memory_ptr+=8;
    }
    fwrite(tmp, 8192, 1, fp);
    memset(tmp,0,8192);
}

J&#233;r&#244;me Richard · Accepted Answer

TL;DR: tune your compiler flags (or use SIMD intrinsics/libraries if your compiler it not very clever) and memory mapped files. If this is not enough, you can use multiple thread with OpenMP and lower-level API avoiding many overheads at the expense of less portable and more complex code.

Vectorization using SIMD instructions

Your current code is already quite good assuming the compiler you use is tuned to generate a fast code and its heuristics are sufficiently good to generate it. Note that that memcpy and memset function calls with a small fixed size are optimized by compilers (at least for Clang, GCC and ICC). Clang does generate a nearly optimal code for x86-64 modern processors with the flags -O3 -mavx2 which assume the target machine running the program has the AVX2 instruction set (most x86 processors does have it but not all, especially old ones). Here is the generated assembly code of the hot loop:

.LBB0_7:
        vmovdqu ymm0, ymmword ptr [r14 + 8*rcx]
        vmovdqu ymm1, ymmword ptr [r14 + 8*rcx + 32]
        vmovdqu ymm2, ymmword ptr [r14 + 8*rcx + 64]
        vmovdqu ymm3, ymmword ptr [r14 + 8*rcx + 96]
        vpshufb ymm0, ymm0, ymm4
        vpshufb ymm1, ymm1, ymm4
        vpshufb ymm2, ymm2, ymm4
        vpshufb ymm3, ymm3, ymm4
        vmovdqu ymmword ptr [rbx + 8*rcx], ymm0
        vmovdqu ymmword ptr [rbx + 8*rcx + 32], ymm1
        vmovdqu ymmword ptr [rbx + 8*rcx + 64], ymm2
        vmovdqu ymmword ptr [rbx + 8*rcx + 96], ymm3
        add     rcx, 16
        cmp     rcx, 1024
        jne     .LBB0_7

The code shuffle 128 bytes at once in only few cycles (only 4 cycles on my Intel Skylake processor)!

If you are unsure about using the AVX2 instruction set, do not worry, since compilers like GCC and Clang already generate a quite good code with the AVX instruction set (which is available on almost all x86-64 modern processors). You can see that on godbold. Using the flag -mavx is enough to generate a fast code (for Clang/GCC/ICC). For MSVC you need to respectively use the flags /arch:AVX and /arch:AVX2. If you want to support nearly all x86-64 processors at the expense of a slower code, you can use the -mssse3 flag to use SSE instead of AVX.

Note about the support of SIMD instruction sets: the steam survey reports that AVX2 is supported by 87% of its users, AVX by 95% and SSSE3 by 99.3%.

Note that the same thing applies for other hardware architecture: you mainly just need to enable the good flags of your compiler.

Additional optimizations

The final memset(tmp,0,8192); call is not needed since tmp is only written. You can remove it.

fwrite is generally pretty fast because the libc use its own (quite large) buffer internally and it should be relatively adapted to your hardware. However, the data buffers requested from the operating system (OS) need to be copied and this copy is not done efficiently by most kernel implementation (eg. Linux, for complex technical reasons related to the use of SIMD instructions). This copy is generally not a problem in term of performance on hard disk drives (HDD). However, it can introduces an overhead on fast solid-state drives (SSD), especially high-throughput NVMe ones capable on writing several GiB/s. One way to speed this up, it to use mapped memory files (see here, here and there for more informations).

If you have a very fast SSD and using memory mapped files is not enough, then you can try to parallelize the loop using OpenMP. This should be very simple in your case since the loop is embarrassingly parallel: just add a #pragma omp parallel for. Note that this solution can be slower and some slow SDD and will certainly be slower on HDD (which does not like non-sequential access nor parallel ones).

If this is still not enough, you can try experimental solutions like using liburing which use the new IO_uring kernel interface available only on Linux (Windows seems to have a similar interface called IoRing). This interface is designed to avoid copies (ie. zero-copy) and system calls (no system calls in hot loops) resulting in a very fast code with almost no overhead. However, using it directly will make your code less portable and much more complex. The above method should probably be enough for you.

Reordering bytes in memory, then writing to file

Answers (1)

Vectorization using SIMD instructions

Additional optimizations

Related Questions