Vectorized Ranged Random number generation across all types

Question

I want to support the following operation in C++:

void generate_random_simd(T* array, T upper_bound, T lower_bound) {
 // uses simd instructions for rng in range [lower_bound, upper_bound]
}

The type T could be any uint, int or float type - 32 or 64 bit. Is there an efficient implementation either available directly or some literature on this material?

I did find a few implementations, such as this and this. But they don't support all the above types nor do they support supplying an upper-lower bound. Using them might in turn require additional processing to achieve the result whose overhead Im afraid would end up being equivalent to simply looping and using a standard C++ random number generator(non-simd).

Peter Cordes · Accepted Answer

Element boundaries only matter when you have lower/upper bounds. Otherwise for integer you just want 128 or 256 bits of random data in a SIMD vector.

For example, you can use an SSE2 / AVX2 xorshift+ which runs multiple xorshift+ generators in 64-bit SIMD elements. You can treat this as 16x uint8_t, or 2x uint64_t, or anything in between, when you want to actually use the random data for something.

Here's an example of using that as 16-bit elements -> multiple vectors of decimal digits, in my answer on What's the fastest way to generate a 1 GB text file containing random digits? over on unix.SE. (Written in C with Intel intrinsics, with Core 2, Haswell, and Skylake benchmark numbers).

It runs fast enough that you'll want to consume the output while it's still hot in cache, e.g. cache-block in chunks of 4 or 8 kiB for L1d hits. Or simply use a vector of random numbers as they're produced.

You can of course use a different divisor, and add something to each element to get a range other than 0..upper. But it's most efficient with a compile-time-constant range. Still, you could use libdivide for SIMD division (or modulo) by a runtime variable.

With unknown upper/lower bounds, you probably only want to use the input vector for one vector of results. When I was optimizing for max speed, it made sense to process multiple 0..9 digits out of a 16-bit integer, to save xorshift+ work. 0..9 is such a small fraction of 0..65535 that there's plenty of entropy left, and has a different bias than the first remainder.

FP is harder than integer because some bit-patterns represent NaN. And you often want a uniform distribution along the real number line, not a uniform distribution of finite bit patterns. (Half of all representable float values have magnitude less than 1.0. The closer to zero you get, the closer together floats can be.)

Apparently it's common to generate uniform FP random numbers in a [0,1.0) range. (1/4 of the total representable values.) Scaling oto a [0, N) range with a multiply works fine for N < 2^24, but for larger than that you start to lose entropy and introduce bias, according to Daniel Lemire's article, "How many floating-point numbers are in the interval [0,1]?".

Depending on the size of your range, it seems to me it's a lot easier to generate them in the [1.0, 2.0) range (or any other single-exponent range), by combining a 23-bit random significand (mantissa) with a fixed exponent / sign-bit.

That's fewer bits of entropy, but is trivially uniform and can be done with a SIMD _mm_and_ps and _mm_or_ps. (Unfortunately for this, the significand is only 23 bits wide, not a multiple of 8 or 16, so we can't simply use _mm_blendv_epi8 or _mm_blend_epi16)

If you want a distribution other than uniform, (https://en.wikipedia.org/wiki/Random_number_generation#Generation_from_a_probability_distribution), e.g. Gaussian or Poisson, you'll have to find an algorithm for that.

Sampling with rejection doesn't work well for SIMD because of the required branching. You could maybe do 2 candidate vectors of random numbers and branchlessly combine them, then branch if any still need to be rejected.

Maybe left-packing the non-rejected candidates would let you fairly efficiently fill a buffer with random numbers, producing a variable number each iteration. See AVX2 what is the most efficient way to pack left based on a mask? for SSE2 / AVX2 / AVX512 left-packing.

Again, keep the buffer chunk size small enough that you get L1d or at least L2 cache hits when you loop back over it.

Vectorized Ranged Random number generation across all types

Answers (1)

Related Questions