Reputation: 13968
I've got a program which is performance-reliant on the rotate-left instruction.
Under MSVC, it works fairly well, just define the _rotl() intrinsic as the target for rotate left.
Under GCC for Linux, it also works well. Here it is enough to define the equivalent software construction rotl32(x,r) = ((x << r) | (x >> (32 - r)))
, the compiler is clever enough to recognize this is a 32-bits rotate left, and automatically replace it by its intrinsic equivalent (to be fair, MSVC is also able to make such detection).
Under MinGW, not so much. This is all the more intriguing as MinGW is using, at its core, GCC. MinGW can compile the windows intrinsic _rotl
, but without apparently triggering the corresponding intrinsic. The software version seems also undetected, although to be fair, it is nonetheless faster than _rotl
. The end result is a 10x reduction in performance, so it is definitely significant.
Note : GCC version of tested MinGW is 4.6.2
Upvotes: 2
Views: 2099
Reputation: 3785
Just include intrin.h
header.
It is windows-specific header so if you are developing a cross-platform software do not forget to wrap it with a condition like that:
#ifdef _WIN32
# include <intrin.h>
#endif
Run on (4 X 3310 MHz CPU s)
09/07/16 23:29:35
Benchmark Time CPU Iterations
----------------------------------------------------------
BM_rotl/8 19 ns 18 ns 37392923
BM_rotl/64 156 ns 149 ns 4487151
BM_rotl/512 1148 ns 1144 ns 641022
BM_rotl/4k 9286 ns 9178 ns 74786
BM_rotl/32k 71575 ns 69535 ns 8974
BM_rotl/256k 583148 ns 577204 ns 1000
BM_rotl/2M 4769689 ns 4830999 ns 155
BM_rotl/8M 19997537 ns 18720120 ns 35
BM_rotl_intrin/8 6 ns 6 ns 112178768
BM_rotl_intrin/64 55 ns 53 ns 14022346
BM_rotl_intrin/512 431 ns 407 ns 1725827
BM_rotl_intrin/4k 3327 ns 3338 ns 224358
BM_rotl_intrin/32k 27093 ns 26596 ns 26395
BM_rotl_intrin/256k 217633 ns 214167 ns 3205
BM_rotl_intrin/2M 1885492 ns 1853925 ns 345
BM_rotl_intrin/8M 8015337 ns 7626716 ns 90
#include <benchmark/benchmark.h>
#define MAKE_ROTL_BENCHMARK(name) \
static void name(benchmark::State& state) { \
auto arr = new uint32_t[state.range(0)]; \
while (state.KeepRunning()) { \
for (int i = 0; i < state.range(0); ++i) { \
arr[i] = _rotl(arr[i], 16); \
} \
} \
delete [] arr; \
} \
/**/
MAKE_ROTL_BENCHMARK(BM_rotl)
#include <intrin.h>
MAKE_ROTL_BENCHMARK(BM_rotl_intrin)
#undef MAKE_ROTL_BENCHMARK
BENCHMARK(BM_rotl)->Range(8, 8<<20);
BENCHMARK(BM_rotl_intrin)->Range(8, 8<<20);
BENCHMARK_MAIN()
Upvotes: 2
Reputation: 181077
Just in case you're stuck with the intrinsic on Windows, here's a way to do it using inline assembler on x86;
uint32_t rotl32_2(uint32_t x, uint8_t r) {
asm("roll %1,%0" : "+r" (x) : "c" (r));
return x;
}
Tested on Ubuntu's gcc, but should work well on mingw.
Upvotes: 3