Why samples generated by np.random.multivariate_normal method are not compatible with covariance matrix?

Question

I am working on data-driven robust optimization approach. In the numerical results part, I need to validate the method using sample data. I use np.random.multivariate_normal to generate the data for which I give the covariance matrix and mean vector as follows:

mean = [-1000, 1,1,1]
cov = [[200,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]

After generating the data, the first component of the samples are not distributed in the interval [-1200,-800], and instead, they belong to a smaller interval (e.g., [-1003,-997]). I need to generate sample data whose first components are more extensively distributed through the interval [-1200,-800]. The code that I use is as follows:

import numpy as np
import matplotlib.pyplot as plt
from numpy.random import multivariate_normal
# First 2D gaussian:
n=50 #number of samples
mean = [-1000, 1,1,1]
cov = [[200,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]
Samples = np.random.multivariate_normal(mean, cov, n).T

I do not know where I am going wrong.

Maximilian Press · Accepted Answer

It seems that you are asking for a mean of -1000 and a variance of 1 (or rather, the original question was doing so; it appears to have been updated to 200, which however does not match the observations of -1003:-997 stated later).

Note that we expect 99.9% of the values in an infinite population to be within 3 sigma (standard deviation, $\sqrt{Var}$), which for you is essentially 3. So your results are expected (see image).

If you want a larger variance you will need to specify that in cov.

Guess and check to get the right variance

Here is some messing around I did to show this gradually increasing the variance:

>>> import numpy as np
# increase n to get more asymptotic
>>> n = 1000
# what you had
>>> mean = [-1000, 1,1,1]
>>> cov = [[1,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]
>>> Samples = np.random.multivariate_normal(mean, cov, n).T
>>> min(Samples[0,:])
-1003.1521026984535
# larger variance ([0, 0] element)
>>> cov = [[200,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]
>>> Samples = np.random.multivariate_normal(mean, cov, n).T
>>> min(Samples[0,:])
-1058.8437937762053
# yet larger variance
>>> cov = [[2000,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]
>>> Samples = np.random.multivariate_normal(mean, cov, n).T
>>> min(Samples[0,:])
-1145.3564799931166
# yet larger variance
>>> cov = [[4000,0,0,0],[0,0.001,0,0],[0,0,0.001,0],[0,0,0,0.001]]
>>> Samples = np.random.multivariate_normal(mean, cov, n).T
>>> min(Samples[0,:])
-1247.6489017302786

Another approach to find the desired variance

You can additionally figure this out analytically. Let's say that you want -1200 to -800 to be your range, and your mean is -1000. Thus, you want 3 sigma to be 200, so you want sigma to be ~66.7.

Variance is sigma squared, so $66.7^2 = 4448.89$. This turns out to be quite close to the answer I arrived at by guess and check, a variance of 4000.

Thus, based on the desired range of your data, you can choose variances in a disciplined fashion.

Why samples generated by np.random.multivariate_normal method are not compatible with covariance matrix?

Answers (1)

Guess and check to get the right variance

Another approach to find the desired variance

Related Questions