Cyan
Cyan

Reputation: 13968

Poor _rotl performance under minGW

I've got a program which is performance-reliant on the rotate-left instruction.

Under MSVC, it works fairly well, just define the _rotl() intrinsic as the target for rotate left.

Under GCC for Linux, it also works well. Here it is enough to define the equivalent software construction rotl32(x,r) = ((x << r) | (x >> (32 - r))) , the compiler is clever enough to recognize this is a 32-bits rotate left, and automatically replace it by its intrinsic equivalent (to be fair, MSVC is also able to make such detection).

Under MinGW, not so much. This is all the more intriguing as MinGW is using, at its core, GCC. MinGW can compile the windows intrinsic _rotl, but without apparently triggering the corresponding intrinsic. The software version seems also undetected, although to be fair, it is nonetheless faster than _rotl. The end result is a 10x reduction in performance, so it is definitely significant.

Note : GCC version of tested MinGW is 4.6.2

Upvotes: 2

Views: 2099

Answers (2)

Nikita Kniazev
Nikita Kniazev

Reputation: 3785

Just include intrin.h header.

It is windows-specific header so if you are developing a cross-platform software do not forget to wrap it with a condition like that:

#ifdef _WIN32
# include <intrin.h>
#endif

Benchmark

Run on (4 X 3310 MHz CPU s)
09/07/16 23:29:35
Benchmark                    Time           CPU Iterations
----------------------------------------------------------
BM_rotl/8                   19 ns         18 ns   37392923
BM_rotl/64                 156 ns        149 ns    4487151
BM_rotl/512               1148 ns       1144 ns     641022
BM_rotl/4k                9286 ns       9178 ns      74786
BM_rotl/32k              71575 ns      69535 ns       8974
BM_rotl/256k            583148 ns     577204 ns       1000
BM_rotl/2M             4769689 ns    4830999 ns        155
BM_rotl/8M            19997537 ns   18720120 ns         35
BM_rotl_intrin/8             6 ns          6 ns  112178768
BM_rotl_intrin/64           55 ns         53 ns   14022346
BM_rotl_intrin/512         431 ns        407 ns    1725827
BM_rotl_intrin/4k         3327 ns       3338 ns     224358
BM_rotl_intrin/32k       27093 ns      26596 ns      26395
BM_rotl_intrin/256k     217633 ns     214167 ns       3205
BM_rotl_intrin/2M      1885492 ns    1853925 ns        345
BM_rotl_intrin/8M      8015337 ns    7626716 ns         90

Benchmark code

#include <benchmark/benchmark.h>

#define MAKE_ROTL_BENCHMARK(name) \
  static void name(benchmark::State& state) { \
    auto arr = new uint32_t[state.range(0)]; \
    while (state.KeepRunning()) { \
      for (int i = 0; i < state.range(0); ++i) { \
        arr[i] = _rotl(arr[i], 16); \
      } \
    } \
    delete [] arr; \
  } \
  /**/

MAKE_ROTL_BENCHMARK(BM_rotl)
#include <intrin.h>
MAKE_ROTL_BENCHMARK(BM_rotl_intrin)

#undef MAKE_ROTL_BENCHMARK

BENCHMARK(BM_rotl)->Range(8, 8<<20);
BENCHMARK(BM_rotl_intrin)->Range(8, 8<<20);

BENCHMARK_MAIN()

Upvotes: 2

Joachim Isaksson
Joachim Isaksson

Reputation: 181077

Just in case you're stuck with the intrinsic on Windows, here's a way to do it using inline assembler on x86;

uint32_t rotl32_2(uint32_t x, uint8_t r) {
  asm("roll %1,%0" : "+r" (x) : "c" (r));
  return x;
}

Tested on Ubuntu's gcc, but should work well on mingw.

Upvotes: 3

Related Questions