Reputation: 3256
I want to support the following operation in C++:
void generate_random_simd(T* array, T upper_bound, T lower_bound) {
// uses simd instructions for rng in range [lower_bound, upper_bound]
}
The type T could be any uint, int or float type - 32 or 64 bit. Is there an efficient implementation either available directly or some literature on this material?
I did find a few implementations, such as this and this. But they don't support all the above types nor do they support supplying an upper-lower bound. Using them might in turn require additional processing to achieve the result whose overhead Im afraid would end up being equivalent to simply looping and using a standard C++ random number generator(non-simd).
Upvotes: 0
Views: 1172
Reputation: 364458
Element boundaries only matter when you have lower/upper bounds. Otherwise for integer you just want 128 or 256 bits of random data in a SIMD vector.
For example, you can use an SSE2 / AVX2 xorshift+ which runs multiple xorshift+ generators in 64-bit SIMD elements. You can treat this as 16x uint8_t
, or 2x uint64_t
, or anything in between, when you want to actually use the random data for something.
Here's an example of using that as 16-bit elements -> multiple vectors of decimal digits, in my answer on What's the fastest way to generate a 1 GB text file containing random digits? over on unix.SE. (Written in C with Intel intrinsics, with Core 2, Haswell, and Skylake benchmark numbers).
It runs fast enough that you'll want to consume the output while it's still hot in cache, e.g. cache-block in chunks of 4 or 8 kiB for L1d hits. Or simply use a vector of random numbers as they're produced.
You can of course use a different divisor, and add something to each element to get a range other than 0..upper. But it's most efficient with a compile-time-constant range. Still, you could use libdivide for SIMD division (or modulo) by a runtime variable.
With unknown upper/lower bounds, you probably only want to use the input vector for one vector of results. When I was optimizing for max speed, it made sense to process multiple 0..9 digits out of a 16-bit integer, to save xorshift+ work. 0..9 is such a small fraction of 0..65535 that there's plenty of entropy left, and has a different bias than the first remainder.
FP is harder than integer because some bit-patterns represent NaN. And you often want a uniform distribution along the real number line, not a uniform distribution of finite bit patterns. (Half of all representable float
values have magnitude less than 1.0. The closer to zero you get, the closer together float
s can be.)
Apparently it's common to generate uniform FP random numbers in a [0,1.0)
range. (1/4 of the total representable values.) Scaling oto a [0, N)
range with a multiply works fine for N < 2^24, but for larger than that you start to lose entropy and introduce bias, according to Daniel Lemire's article, "How many floating-point numbers are in the interval [0,1]?".
Depending on the size of your range, it seems to me it's a lot easier to generate them in the [1.0, 2.0)
range (or any other single-exponent range), by combining a 23-bit random significand (mantissa) with a fixed exponent / sign-bit.
That's fewer bits of entropy, but is trivially uniform and can be done with a SIMD _mm_and_ps
and _mm_or_ps
. (Unfortunately for this, the significand is only 23 bits wide, not a multiple of 8 or 16, so we can't simply use _mm_blendv_epi8
or _mm_blend_epi16
)
If you want a distribution other than uniform, (https://en.wikipedia.org/wiki/Random_number_generation#Generation_from_a_probability_distribution), e.g. Gaussian or Poisson, you'll have to find an algorithm for that.
Sampling with rejection doesn't work well for SIMD because of the required branching. You could maybe do 2 candidate vectors of random numbers and branchlessly combine them, then branch if any still need to be rejected.
Maybe left-packing the non-rejected candidates would let you fairly efficiently fill a buffer with random numbers, producing a variable number each iteration. See AVX2 what is the most efficient way to pack left based on a mask? for SSE2 / AVX2 / AVX512 left-packing.
Again, keep the buffer chunk size small enough that you get L1d or at least L2 cache hits when you loop back over it.
Upvotes: 2