how does the covariance matrix affect the output when generating correlated normally distributed random samples?

Question

The scipy documentation contains an example of creating correlated random samples. The full code is at the end of the question.

The covariance matrix:

# The desired covariance matrix.
r = np.array([
        [  3.40, -2.75, -2.00],
        [ -2.75,  5.50,  1.50],
        [ -2.00,  1.50,  1.25]
    ])

My question is how does each of the values in the covariance matrix affect the output? I.e. if I want to build sample datasets that just have 2 variables, or have more than 3 variables, how do I determine what values I can use in the covariance matrix?

"""Example of generating correlated normally distributed random samples."""

import numpy as np
from scipy.linalg import eigh, cholesky
from scipy.stats import norm

from pylab import plot, show, axis, subplot, xlabel, ylabel, grid


# Choice of cholesky or eigenvector method.
method = 'cholesky'
#method = 'eigenvectors'

num_samples = 400

# The desired covariance matrix.
r = np.array([
        [  3.40, -2.75, -2.00],
        [ -2.75,  5.50,  1.50],
        [ -2.00,  1.50,  1.25]
    ])

# Generate samples from three independent normally distributed random
# variables (with mean 0 and std. dev. 1).
x = norm.rvs(size=(3, num_samples))

# We need a matrix `c` for which `c*c^T = r`.  We can use, for example,
# the Cholesky decomposition, or the we can construct `c` from the
# eigenvectors and eigenvalues.

if method == 'cholesky':
    # Compute the Cholesky decomposition.
    c = cholesky(r, lower=True)
else:
    # Compute the eigenvalues and eigenvectors.
    evals, evecs = eigh(r)
    # Construct c, so c*c^T = r.
    c = np.dot(evecs, np.diag(np.sqrt(evals)))

# Convert the data to correlated random variables. 
y = np.dot(c, x)

#
# Plot various projections of the samples.
#
subplot(2,2,1)
plot(y[0], y[1], 'b.')
ylabel('y[1]')
axis('equal')
grid(True)

subplot(2,2,3)
plot(y[0], y[2], 'b.')
xlabel('y[0]')
ylabel('y[2]')
axis('equal')
grid(True)

subplot(2,2,4)
plot(y[1], y[2], 'b.')
xlabel('y[1]')
axis('equal')
grid(True)

show()

Alxmrphi · Accepted Answer

how do I determine what values I can use in the covariance matrix?

You don't 'determine' any values. It's completely your choice. If you want to use 2 variables, then the covariance matrix will be (2,2) in shape. If you want the first variable to be correlated with the second, then put a positive value in the [1,2] index. I think you need to perhaps read up on covariance matrices in general and see how the values in a covariance matrix affect the output distribution. It's not a scipy question, per se. You are completely in charge of what you want the values in the covariance matrix to be. It depends on how much you want the RVs to be correlated.

how does the covariance matrix affect the output when generating correlated normally distributed random samples?

Answers (1)

Related Questions