Reputation: 1
I tried to model a random number using numpy.random.normal
. From this random number (mean=0, std=1)
Theoretical statistics, and also R tells me that this must converge towards the chosen std (which is 1). But somehow, using numpy (and scipy.stats), it does not.
This code generates a figure that shows this strange behavior:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, tstd
# system setup
m = 100 # number of measurments
sigma = 1 # sensor std
ez = np.arange(1,6,.05)
sample_sizes = [int(10**e) for e in ez]
# testing normal and std - they seem to work fine
sig_est = []
for n in sample_sizes:
sample = np.random.normal(0, sigma, (n*m))
sig_est += [np.std(sample)]
plt.plot(ez, sig_est, marker='.', color='b', ls='', label='numpy - no means')
# numpy implementation of problem
sig_est = []
for n in sample_sizes:
sample = np.random.normal(0, sigma, (n,m))
sigma_est = np.std(sample, axis=1)
sig_est += [np.mean(sigma_est)]
plt.plot(ez, sig_est, marker='.', color='k', ls='', label='numpy')
# scipy.stats implementation
sig_est = []
for n in sample_sizes:
sample = norm.rvs(loc=0, scale=sigma, size=(n,m))
sigma_est = tstd(sample, axis=1)
sig_est += [np.mean(sigma_est)]
plt.plot(ez, sig_est, marker='.', color='r', ls='', label='scipy.stats')
plt.gca().set(xlabel = 'Number of samples [log10]')
plt.gca().legend()
plt.gca().grid(color='.9')
plt.show()
Any ideas?
Upvotes: 0
Views: 561
Reputation: 1879
That's an interesting one, because it's not a random number generator problem but a mathematics one :-) The short answer is that everything is working as expected.
The point is, in the first example you are taking a larger and larger sample of i.i.d. Gaussians and computing their standard deviation using np.std
. This converges to 1, as your plot shows.
In the second plot, you are computing the standard deviation always over 100 elements, and then averaging these. In this way you are not computing the limit std over many elements, but rather the bias of the estimator of the standard deviation. Which, as you found out, is not zero! This is for two reasons:
ddof=1
to np.std
, see documentation here: https://numpy.org/doc/stable/reference/generated/numpy.std.html.np.std
and before taking the mean. You can see that if you replace your linesig_est += [np.mean(sigma_est)] # equivalent to sig_est.append(np.mean(sigma_est))
by
sig_est.append(np.mean(np.std(sample, axis=1, ddof=1)**2))
in the second block of your code, you'll indeed get convergence to 1.
As to the last implementation using scipy, it seems to use yet another normalization: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.tstd.html
They call it "unbiased" but it's clearly not, on the one hand because your plots clearly show it, on the other hand because the exact factor to get an unbiased estimator (for Gaussians) is much more complicated than n/(n-1), see here: https://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation
Upvotes: 3