gilbertohasnofb
gilbertohasnofb

Reputation: 2054

Non-repeating PRNG algorithm

The following algorithm generates an array of non-repeating random numbers (the example is written in Fortran 95):

program test
implicit none

real :: x
integer :: i, aux
integer, dimension(100) :: y = 0

do i=2,100
  call RANDOM_NUMBER(x)
  aux = int(3 * x) + 1 ! random number: 1, 2 or 3
  aux = aux + y(i-1) ! adding previous selected number
  y(i) = MOD(aux,4) ! mod 4 gives the final result: 0, 1, 2 or 3
  print*, y(i)
enddo

end program test

On another discussion forum, a member proposed this algorithm as a solution to a challenge of how to output non-repeating numbers using a regular random number generator and a fixed amount of operations per loop (so for instance cycling when a random value is the same as the previous would not give a constant number of operations per loop).

His algorithm seems to work well, the results are uniformly distributed and there are no obvious patterns in any sub-strings of any in the output (I searched for sub-strings of sizes 2 to 5 and all behaved as expected). But what puzzles me in this solution is that the random number generator is outputting only three possible numbers (0, 1 or 2) and yet the whole algorithm outputs four possible results (0, 1, 2 or 3). How is this possible? I thought that mapping down the results of a PRNG could be done, but not mapping it up (e.g. if a PRNG produces numbers between 0 and 7, they can be mapped as 0-3 => 0 and 4-7 =>1, but a PRNG producing only 0's and 1's cannot produce results between 0-7 in a same loop – since one could obviously group three results in order to map 000 => 0, 001 => 1, ... 111 => 7).


Edit: this is the same algorithm but written in pseudocode, as this question is not related to Fortran nor any programming language in particular

x ← 0
do
  aux ← random number between 1 and 3
  aux ← aux + x
  x ← aux MOD 4
  print x
enddo

Upvotes: 1

Views: 452

Answers (2)

gilbertohasnofb
gilbertohasnofb

Reputation: 2054

At first sight, the algorithm above seems to take as input random integers ranging between 0 and 2 (i.e. 3 values) and output random integers ranging between 0 and 3 (i.e. 4 values) for each cycle, which seems to be problematic due to upsampling. But actually the algorithm is always choosing among 3 options only, given that each value cannot be the same as the previous one. For instance, if the very first random integer selected is 0, there are three possible values for the next integer (1, 2 or 3), which is exactly what the range PRGN is providing. So the key is to realize that 3 random values are being mapped into 4 non-repeating random values, and this can be done without causing any unwanted patterns.

Therefore, there is no problem using MOD N+1 for a random input ranging from 0 to N, because the amount of information does not change with that. But when we use MOD N+2 or larger, we actually do observe patterns that shouldn't be there if the output was truly random. For instance, certain sequences of two consecutive numbers never appear: e.g. taking N = 3 (i.e. input between 0 and 2) and MOD 5, one will never see a 0 followed by a 4, since there is no input such that the expression ((input + 1) + 0) MOD 5 = 4 would be true.

Upvotes: 0

Kent Weigel
Kent Weigel

Reputation: 1178

Well, I may be missing something, because I don't remember Fortran perfectly.

Why are you allowed to access y(i-1) when i = 1. Isn't that an array boundary violation? I will assume that it just returns zero or something.

The first time through the loop, aux will end up being 1, 2 or 3, assuming y(i-1) = 0, and y(1) will be the same (1, 2 or 3). Then the second time through, aux will be (1, 2 or 3) + (1, 2 or 3) which will be 2, 3, 4, 5 or 6 and y(2) will be 0, 1, 2 or 3, since 4 MOD 4 = 0 and 5 MOD 4 = 1. From there on out, y(n) can be 0, 1, 2, or 3, since you will always be adding (1, 2 or 3) to (0, 1, 2 or 3) and modding by 4.

I have the feeling you are making an assumption that I am not making, and I don't see what valid assumption would restrict the output to only 3 values.

The RANDOM_NUMBER function assigns 0 <= x < 1. It's not clear to me why you are thinking about binary representation of the numbers, since you don't seem to be using bitwise operators.

Edit: That makes more sense. I didn't understand that your main focus was on the distribution of the numbers, statistically speaking. I could probably explain my thoughts on the subject better with a number series, or some statistical notation, if I remembered much about either from academia, but like my wrestler roommate in college who did anything to avoid algebra would have done, I will just map out my thoughts with a bunch of empirical numbers:

The first time through the loop, you would end up with the following possible values: (trivial case)

aux: could be 1, 2 or 3

y(i-1): 0

aux + y(i-1): 1, 2 or 3

y(i): 1, 2 or 3

Now the second time through, you actually have to deal with the weightings caused by the probability of getting each possible result:

aux: 1, 2 or 3

y(i-1): 1, 2 or 3

aux + y(i-1): {the sum of each combination of aux and y(i-1)} 1 2 3 3 4 4 4 5 5 6 {meaning there is a 1/10 chance of getting a 1, same for 2, same for 6 and a 1/5 chance of getting a 3, same for a 5, and a 3/10 chance of getting a 4; this is not a very even distribution}

y(i) = 0 0 0 1 1 1 2 2 3 3 {same rationale, a little more even distribution for this "output"}

The third time through:

aux: 1, 2 or 3

y(i-1): 0 0 0 1 1 2 2 3 3

aux + y(i-1): {after sorting} 1 1 1 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 6 6

y(i): {after sorting} 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3

Balancing a) the risk of assuming that this equalizing trend will continue for the distribution of values with b) the desire to avoid the growing complexity of calculating these sample values, it seems reasonable to extrapolate that the distribution will be somewhat even for growing i.

The best way to prove it to yourself for sure is to alter the algorithm to keep arrays of possible results for each iteration, and output probabilities, rather than random numbers. I will leave that as an exercise for you.

I suspect that the error caused by not having the range of possible values of aux - y(i-1) be a multiple of the right hand operand of MOD, is offset by the hashing nature of MOD. What I mean is, I think that the lopsidedness of probability is distributed within the range of possible values in a sliding, or more likely a rotating, window, throughout the range of values of y(i), from one iteration to the next. Hopefully you understand what I mean by that.

Upvotes: 1

Related Questions