user5497885
user5497885

Reputation:

numpy random usage validity

EDITED2 : Precise question re-formulated. EDITED: Typo in the code:

I would like to generate 1000 samples of 10 INDEPENDANT Random VARIABLE(here gaussian).

Are those 2 equivalents in term of mathematical point of view (Independance between 10 random variables).

direct: 1000 samples of 10 random variables.

import numpy as np
np.random.seed(123)
x= np.random.normal(0,1 (10, 1000))

With Loop: Generate sample by sample the vector of 10 random variables.

 import numpy as np
 np.random.seed(123)
 for(i  in range(0,1000)):
   x[:,i]= np.random.normal(0,1,(10, 1))

Reason is to do paralell sampling... Are those 2 methods generate 1000 samples of 10 Independant gaussian random variable ? (Am not sure than the samples are independant).

I looked into other questions, there is no indication how the Mersenne Twister is implemented in numpy. So, the questions refers if 2 successive calls of the random.normal produce independant values (ie implementation of random.normal).

(Imagine the case that "manually", I reset the seed between the calls...this is obviously not independant...(there is no interest to do it).

Of course, we can check A POSTERIORI, it might be true by looking at the statistics.... it does not prove it....

Upvotes: 2

Views: 319

Answers (1)

Reti43
Reti43

Reputation: 9796

Yes, they are statistically equivalent. As long as you generate numbers for the same normal distribution, they will have the same statistical characteristics. That is, the mean and standard deviation should be very close to those of the distribution they were drawn from. For example,

import numpy as np

mu, sigma = 2.0, 0.5

# A and B will have different seeds and so different numbers
A = np.random.normal(mu, sigma, size=(1000, 1000))
B = np.random.normal(mu, sigma, size=(1000, 1000))

# But their statistical characteristics should be similar enough
print(mu, np.mean(A), np.mean(B))
print(sigma, np.std(A), np.std(B))

The random numbers generated starting from the same seed will always be the same, assuming of course we use the same function, which we do in this case. So, what this means is that your two arrays will contain the same elements, but not in the same order. We can observe this with a small enough example to fit our screen.

size = (3, 5)

def method1():
    np.random.seed(123)
    return np.random.normal(0, 1, size)

def method2():
    np.random.seed(123)
    x = np.zeros(size, dtype=float)
    for i  in range(size[1]):
        x[:,i] = np.random.normal(0, 1, size[0])
    return x

Output

# method 1
array([[-1.0856306 ,  0.99734545,  0.2829785 , -1.50629471, -0.57860025],
       [ 1.65143654, -2.42667924, -0.42891263,  1.26593626, -0.8667404 ],
       [-0.67888615, -0.09470897,  1.49138963, -0.638902  , -0.44398196]])

#method 2
array([[-1.0856306 , -1.50629471, -2.42667924, -0.8667404 ,  1.49138963],
       [ 0.99734545, -0.57860025, -0.42891263, -0.67888615, -0.638902  ],
       [ 0.2829785 ,  1.65143654,  1.26593626, -0.09470897, -0.44398196]])

Both methods generate the same 3*5 numbers. However, the first method puts the first 5 in the first row, the next 5 in the second row, etc. While the second method puts the first 3 in the first column, the next 3 in the second column, etc. In fact, if method2() was rewritten to the following, it'd put the numbers in the same way (row by row) as method1().

# same result as `method1()`
def method3():
    np.random.seed(123)
    x = np.zeros(size, dtype=float)
    for i  in range(size[0]):
        x[i,:] = np.random.normal(0, 1, size[1])
    return x

It goes without saying that if each method generates the same numbers but in different order, the statistical characteristics of each row/column between the two will not be the same (since they have different numbers). However, if the samples for each row/column is large enough, i.e., size, they should obey the statistical characteristic of the distribution they were drawn from.

Edit: Sampling independency

From the documentation see we that numpy implements the Mersenne Twister algorithm (MT). It creates a container, which all other distributions use, such as normal(), exponential(), etc.

The MT is a widely used PRNG and has been studied extensively. Any PRNG worth its salt will perform well against a battery of tests that test the statistical properties of the algorithm. Read about the Diehard tests and TestU01. Search for the keywords normal and uniform (i.e. independent) on those pages to read about specific tests.

You can also read this post about the theoretical approach to a truly random number generator.

Bottom line, the generators we use, we use them because they have been deemed good enough for what we want to do. If you're worried that MT will not be up to the task for your problem, you may want to choose (and probably implement) something different. However, without knowing what you're trying to do, it's impossible to give any more solid advice.

Upvotes: 3

Related Questions