PRNG for ARM Assembly?

Question

I'm having issues implementing a PRNG for ARM Assembly. I've tried a few algorithms that while working, end up taking a long time after the first few random number iterations, probably because the division (modulo) step takes a long time on large numbers. I'm trying to get a random number between 0 and 31. I've given my rough work below, with letters substituting specific registers.

start:

mov t, x            // t = x

// t ^= t << 11
lsl temp, t, #11
eor t, temp

// t ^= t >> 8
lsr temp, t, #8
eor t, temp

// z = w
mov z, w

// x = y
mov x, y

// y = z
mov y, z

// w ^= w >> 19
lsr temp, w, #19
eor w, temp

// w ^= t
eor w, t

// result is the RETURNED RANDOM NUMBER
mov result, w

That's my algorithm that I tried to implement from the XORSHIFT page on wikipedia. I just need this to return a random number from 0 to 31, so implementing division on a 10-digit number takes a while and seems pretty overkill. If anyone can help me optimize or point out a mistake I'd appreciate it.

edit: The above subroutine returns the random number, and then I basically divide it by 31 (that code isn't given here) and take the remainder as my "random" number from 0 to 31.

Peter Cordes · Accepted Answer

ARM instructions can shift or even rotate their inputs on the fly, so using separate left-shift instructions is a waste. Apparently in Thumb mode, only 32bit thumb instructions can use the barrel shifter.

Note that if your routine is a function that you actually call, rather than just an inline snippet from a loop, then it doesn't follow the standard ABI. That's fine if the only caller is also written by you in asm. If you can dedicate 4 registers to PRNG state in your loop, then you don't need to pass pointers around or load/store.

As always, compiler output is often a good starting point:

// we need a loop to see how many mov instructions are actually needed when keeping state in regs
// Otherwise we just get loads/stores
uint32_t xorshift_loop(uint32_t *output, uint32_t x, uint32_t y, uint32_t z, uint32_t w) {
  for(int i=0 ; i<1000 ; ++i) {
    uint32_t t = x;
    t ^= t << 11;
    t ^= t >> 8;
    x = y; y = z; z = w;
    w ^= w >> 19;
    w ^= t;
    *(++output) = w;
  }
  return w;
}

The inner loop is:

## 32bit ARM gcc 4.8  -O3 -fverbose-asm
## The @comments are from -fverbose-asm, which is more helpful than usual here
.L4:
        eor     r6, r1, r1, lsl #11       @, t, x, x,
        eor     r5, r4, r4, lsr #19       @, w, w, w,
        eors    r5, r5, r6                @, t, w, t
        mov     r1, r2                    @ x, y
        eor     r5, r5, r6, lsr #8        @, w, t, t,
        str     r5, [r0, #4]!     @ w, MEM[base: _42, offset: 4B]  // this is a post-increment store
        cmp     r0, r7    @ ivtmp.20, D.4237
        mov     r2, r3    @ y, z
        mov     r3, r4    @ z, w
        mov     r4, r5    @ w, w
        bne     .L4       @,

Notice how the XORs are reordered so the first couple instructions are part of separate dep chains. That will help for superscalar in-order ARM cores, or if eor with a shifted operand has latency greater than 1. It also chooses to do w^=t; w^= t>>8 instead of t^= t>>8; w^=t;, but IDK if there's any particular advantage to that.

An unroll by two would get rid of all the mov instructions, leaving each iteration doing just the four eor instructions with shifted inputs per result. It looks like gcc with -funroll-loops unrolls by 8, so the code is hard to follow.

xorshift+ is apparently pretty good quality for something that's so fast.

It compiles to short code on AArch64, and still short/efficient-looking code on 32bit ARM. Much more code than just xorshift, though. See my godbolt compiler explorer link above

PRNG for ARM Assembly?

Answers (1)

Related Questions