Reputation: 45
Here is a very simple script generating a 2D gaussian with 10000 points. The covariance matrix estimated by np.cov seems really far from the generating one. What is the explanation and are there solutions ?
import numpy as np
import matplotlib.pyplot as plt
center=[0,0]
npoints=10000
data_covmat = np.array([[1,1],[1,0.5]])
lines=np.random.multivariate_normal(center,data_covmat,npoints)
print(f'2D gaussian centered at {center}, {npoints} points\nCovariance matrix =')
print(data_covmat)
plt.scatter(lines[:,0],lines[:,1],alpha=.1)
plt.axis('scaled')
plt.show()
print(f'Sample covariance matrix =\n{np.cov(lines,rowvar=False)}')
Covariance matrix =
[[1. 1. ] [1. 0.5]]
Sample covariance matrix =
[[1.23880367 0.74585136] [0.74585136 0.85974812]]
Upvotes: 0
Views: 176
Reputation: 114811
The array [[1, 1], [1, 0.5]] is not positive semidefinite. One of its eigenvalues is negative. The description of the cov
argument in the docstring of multivariate_normal
says "Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling."
Try it with, say, [[1, 0.6], [0.6, 0.5]], which is symmetric and positive definite, and it works as expected:
In [37]: npoints = 10000
In [38]: center = [0, 0]
In [39]: data_covmat = np.array([[1, 0.6], [0.6, 0.5]])
In [40]: np.linalg.eigvals(data_covmat)
Out[40]: array([1.4, 0.1])
In [41]: lines = np.random.multivariate_normal(center, data_covmat, npoints)
In [42]: np.cov(lines, rowvar=False)
Out[42]:
array([[0.99782727, 0.60349542],
[0.60349542, 0.50179535]])
Upvotes: 1