InvisibleShadowGhost
InvisibleShadowGhost

Reputation: 201

Why is transforming an array using AVX-512 instructions significantly slower when transforming it in batches of 8 compared to 7 or 9?

Please consider the following minimal example minimal.cpp (https://godbolt.org/z/x7dYes91M).

#include <immintrin.h>

#include <algorithm>
#include <ctime>
#include <iostream>
#include <numeric>
#include <vector>

#define NUMBER_OF_TUPLES 134'217'728UL

void transform(std::vector<int64_t>* input, std::vector<double>* output, size_t batch_size) {
  for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch += batch_size) {
    size_t endOfBatch = std::min(startOfBatch + batch_size, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx < endOfBatch;) {
      if (endOfBatch - idx >= 8) {
        auto _loaded = _mm512_loadu_epi64(&(*input)[idx]);
        auto _converted = _mm512_cvtepu64_pd(_loaded);

        _mm512_storeu_epi64(&(*output)[idx], _converted);
        idx += 8;
      } else {
        (*output)[idx] = static_cast<double>((*input)[idx]);
        idx++;
      }
    }

    asm volatile("" : : "r,m"(output->data()) : "memory");
  }
}

void do_benchmark(size_t batch_size) {
  std::vector<int64_t> input(NUMBER_OF_TUPLES);
  std::vector<double> output(NUMBER_OF_TUPLES);

  std::iota(input.begin(), input.end(), 0);

  auto t = std::clock();
  transform(&input, &output, batch_size);
  auto elapsed = std::clock() - t;

  std::cout << "Elapsed time for a batch size of " << batch_size << ": " << elapsed << std::endl;
}

int main() {
  do_benchmark(7UL);
  do_benchmark(8UL);
  do_benchmark(9UL);
}

It transforms the input array of int64_t to the output array of double in batches of a given batch_size. We have inserted the following AVX-512 intrinsics in case there are still more or equal than 8 tuples in the input, to process them all at once and therefore increase the performance

auto _loaded = _mm512_loadu_epi64(&(*input)[idx]);
auto _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&(*output)[idx], _converted);

Otherwise, we fall back to the scalar implementation.

To make sure that the compiler doesn't collapse the two loops, we use the asm volatile("" : : "r,m"(output->data()) : "memory") call, to make sure that the output data is flushed after each batch.

It is compiled and executed on an Intel(R) Xeon(R) Gold 5220R CPU using

clang++ -Wall -Wextra -march=cascadelake -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vnni -O3 minimal.cpp -o minimal

Executing the code, however, results in the following surprising output

Elapsed time for a batch size of 7: 204007
Elapsed time for a batch size of 8: 237600
Elapsed time for a batch size of 9: 209838

It shows, that for some reason, using a batch_size of 8, the code is significantly slower. However, both, using a batch_size of 7 or 9, is significantly faster.

This is surprising to me, since a batch size of 8 should be the perfect configuration, since it only has to use the AVX-512 instructions and can always perfectly process 64 Byte at a time. Why is this case so significantly slower, though?

Edit:

Added perf results for cache misses

Batch Size 7

 Performance counter stats for process id '653468':

     6,894,467,363      L1-dcache-loads                                               (44.43%)
     1,647,244,371      L1-dcache-load-misses     #   23.89% of all L1-dcache accesses  (44.43%)
     7,548,224,648      L1-dcache-stores                                              (44.43%)
         6,726,036      L2-loads                                                      (44.43%)
         3,766,847      L2-loads-misses           #   56.61% of all LL-cache accesses  (44.46%)
         6,171,407      L2-loads-stores                                               (44.45%)
         6,764,242      LLC-loads                                                     (44.46%)
         4,548,106      LLC-loads-misses          #   68.35% of all LL-cache accesses  (44.46%)
         6,954,088      LLC-loads-stores                                              (44.45%)

Batch Size 8

 Performance counter stats for process id '654880':

     1,009,889,247      L1-dcache-loads                                               (44.41%)
     1,413,152,123      L1-dcache-load-misses     #  139.93% of all L1-dcache accesses  (44.45%)
     1,528,453,525      L1-dcache-stores                                              (44.48%)
       158,053,929      L2-loads                                                      (44.51%)
       155,407,942      L2-loads-misses           #   98.18% of all LL-cache accesses  (44.50%)
       158,335,431      L2-loads-stores                                               (44.46%)
       158,349,901      LLC-loads                                                     (44.42%)
       155,902,630      LLC-loads-misses          #   98.49% of all LL-cache accesses  (44.39%)
       158,447,095      LLC-loads-stores                                              (44.39%)

      11.011153400 seconds time elapsed

Batch Size 9

 Performance counter stats for process id '656032':

     1,766,679,021      L1-dcache-loads                                               (44.38%)
     1,600,639,108      L1-dcache-load-misses     #   90.60% of all L1-dcache accesses  (44.42%)
     2,233,035,727      L1-dcache-stores                                              (44.46%)
       138,071,488      L2-loads                                                      (44.49%)
       136,132,162      L2-loads-misses           #   98.51% of all LL-cache accesses  (44.52%)
       138,020,805      L2-loads-stores                                               (44.49%)
       138,522,404      LLC-loads                                                     (44.45%)
       135,902,197      LLC-loads-misses          #   98.35% of all LL-cache accesses  (44.42%)
       138,122,462      LLC-loads-stores                                              (44.38%)

Upvotes: 7

Views: 1238

Answers (1)

Peter Cordes
Peter Cordes

Reputation: 365717

Update: testing (see comments) shows misalignment was not the explanation, and somehow aligning the arrays by 64 makes it slower. I wouldn't expect any 4k aliasing problem since we're loading and then storing, and large aligned allocations probably have the same alignment relative to a page boundary. i.e. are the same % 4096, probably 0. This is true even after simplifying the loops to not do so much branching with a short inner loop.


Your arrays are large and not aligned by 64, since you let std::vector<> allocate them. Using 64-byte vectors, every misaligned load will span a boundary between two 64-byte cache lines. (And you'll trip over the page-split at the end of every 4k page, although that's rare enough in sequential access to not explain this.) Unlike with 32-byte load/store where only every other vector will be a cache-line split.

(Glibc's malloc / new for large allocations typically keeps the first 16 bytes for bookkeeping, so the address it returns is 16 bytes past the start of a page, always misaligned by 32 and 64, always creating the worst case.)

512-bit vectors (on Skylake/Cascade Lake at least) are known to slow down with misaligned 64-byte loads/stores (more than AVX1/2 code with misaligned 32-byte ops). Even when arrays are so large that you'd expect it to just bottleneck on DRAM bandwidth and have time to sort out any misalignment penalties inside the core while waiting for cache lines to arrive.

Single-core DRAM bandwidth on a big Xeon is pretty low vs. a "client" CPU, especially for Skylake-family. (The mesh interconnect was new in that generation, and it's lower than in Broadwell Xeon. Apparently Ice Lake Xeon made a big improvement to max per-core DRAM bandwidth.) So even scalar code is able to saturate memory bandwidth.

(Or perhaps batch=7 was auto-vectorizing with -mprefer-vector-width=256 after fully unrolling the inner loop? No, it wasn't even inlining your loop, and not unswitching that loop into while(full vector left) vector; / while(any left) scalar;, so you have pretty nasty asm that does a lot of branching for each vector and scalar.)

But for some reason code that only ever uses 64-byte loads and stores can't max out one core's bandwidth. But your experiment shows that even a pattern of 1 vector + 1 scalar can help (batch=9), assuming that compiled to match the source.

I don't know why; maybe the load execution units run out of split buffers for handling loads that need data from two cache lines. (Perf event ld_blocks.no_sr). But the scalar loads don't need a split buffer entry because they're always naturally aligned (to 8 bytes). So they can execute if dispatched, maybe triggering fetch of cache lines sooner.

(HW prefetch doesn't work across 4k page boundaries where physical memory might be discontiguous; the L2 streamer only sees physical addresses. So a demand load into the next 4k page can get HW prefetch started early enough to max out DRAM bandwidth to L2, where maybe that wasn't happening if later split vector loads weren't happening. 4k boundaries apply even if using 2M transparent hugepages; the hardware prefetcher doesn't get told that the fetches are part of a contiguous hugepage.)

Batch=9 also makes one of every eight vectors aligned, which might help slightly.

These are wild guesses about microarchitectural causes, not backed up by any performance experiments to test these hypotheses.


Testing with aligned buffers

If you want to at least test that it's misalignment responsible for the whole thing, either look into using a custom allocator for std::vector<int64_t, my_aligned_allocator> and/or std::vector<double, my_aligned_allocator>. (Modern approach to making std::vector allocate aligned memory). This is a good bet for production use, as it then works the same way as std::vector<int64_t>, although the 2nd template parameter makes it not type compatible.

For a quick experiment, make them std::vector<__m512i> and/or <__m512d> and change the loop code. (And compile with at least C++17 to make the standard library respect alignof(T).) (Useful to see whether source or destination misalignment is the critical factor, or both.) For batch=8 you can directly loop over the vectors. In the general case you'll need to static_cast<char*>(src->data()) and do the appropriate pointer math if you want to test this way. GNU C might define behaviour of pointing an double* into a __m512d because it happens to be defined in terms of double, but there are examples of pointing an int* at a __m256i not working as hoped. For a performance experiment, you can just check the asm and see if it's sane.

(Also you'd want to check that the compiler unrolled that inner loop, not actually branching inside a loop.)

Or use aligned_alloc to get raw storage instead of std::vector. But then you'd need to write to both arrays yourself to avoid page faults being part of the timed region for the first test, like std::vector's constructor does. (Idiomatic way of performance evaluation?) (std::vector is annoying when you don't want to write memory before your SIMD loop, since using .emplace_back is a pain with SIMD intrinsics. Not to mention that it sucks at growing, unable to use realloc in most C++ implementations to sometimes avoid having to copy.)

Or instead of writing an init loop or memset, do a warm-up pass? Good idea anyway for AVX-512 to make sure the 512-bit execution units are warmed up, and the CPU is at a frequency where it's able to run 512-bit FP instructions at the lowish throughput needed. (SIMD instructions lowering CPU frequency)

(Maybe __attribute__((noinline,noipa)) on do_benchmark, although I don't think Clang knows GCC's noipa attribute = no inter-procedural analysis.)

Upvotes: 1

Related Questions