Reputation: 2179
I am following a tutorial about covariance matrices that could be found here: http://stats.seandolinar.com/making-a-covariance-matrix-in-r/
It includes the following steps:
#create a dataframe
a <- c(1,2,3,4,5,6)
b <- c(2,3,5,6,1,9)
c <- c(3,5,5,5,10,8)
d <- c(10,20,30,40,50,55)
e <- c(7,8,9,4,6,10)
#create matrix from vectors
M <- cbind(a,b,c,d,e)
M_mean <- matrix(data=1, nrow=n) %*% cbind(mean(a),mean(b),mean(c),mean(d),mean(e))
k <- ncol(M) #number of variables
n <- nrow(M) #number of subjects
And then creating a difference matrix like this:
D <- M - M_mean
This is all pretty straighforward to me. But the next step does this to create a covariance matrix:
C <- (n-1)^-1 t(D) %*% D
I get that the part t(D) %% D is divided by (n-1)^1 = 6. But I do not get how exactly t(D) %% D is build up.
Could anybody explain this to me?
Upvotes: 2
Views: 5899
Reputation: 73265
But I do not get how exactly t(D) %% D is built up.
This is matrix cross product, a special form of matrix multiplication. If you don't understand what it is doing, consider the following R loop to help you absorb this:
DtD <- matrix(0, nrow = ncol(D), ncol = ncol(D))
for (j in 1:ncol(D))
for (i in 1:ncol(D))
DtD[i, j] <- sum(D[, i] * D[, j])
Note, nobody is actually going to write R loop for this; this is just to help you understand the algorithm.
Original Answer
Suppose we have a matrix X
, where each column gives observations for a specific random variable, normally we just use R base function cov(X)
to get covariance matrix.
Now you want to write a covariance function yourself; that is also not difficult (I did this a long time ago as an exercise). It takes 3 steps:
nrow(X) - 1
not nrow(X)
for bias adjustment).This short code does it:
crossprod(sweep(X, 2L, colMeans(X))) / (nrow(X) - 1L)
Consider a small example
set.seed(0)
## 3 variable, each with 10 observations
X <- matrix(rnorm(30), nrow = 10, ncol = 3)
## reference computation by `cov`
cov(X)
# [,1] [,2] [,3]
#[1,] 1.4528358 -0.20093966 -0.10432388
#[2,] -0.2009397 0.46086672 -0.05828058
#[3,] -0.1043239 -0.05828058 0.48606879
## own implementation
crossprod(sweep(X, 2L, colMeans(X))) / (nrow(X) - 1L)
# [,1] [,2] [,3]
#[1,] 1.4528358 -0.20093966 -0.10432388
#[2,] -0.2009397 0.46086672 -0.05828058
#[3,] -0.1043239 -0.05828058 0.48606879
What if you want to get correlation matrix?
There are many ways. If we want to get it directly, do:
crossprod(scale(X)) / (nrow(X) - 1L)
# [,1] [,2] [,3]
#[1,] 1.0000000 -0.2455668 -0.1241443
#[2,] -0.2455668 1.0000000 -0.1231367
#[3,] -0.1241443 -0.1231367 1.0000000
If we want to first get covariance, then (symmetrically) rescale it by root diagonal to get correlation, we can do:
## covariance first
V <- crossprod(sweep(X, 2L, colMeans(X))) / (nrow(X) - 1L)
## symmetric rescaling
V / tcrossprod(diag(V) ^ 0.5)
# [,1] [,2] [,3]
#[1,] 1.0000000 -0.2455668 -0.1241443
#[2,] -0.2455668 1.0000000 -0.1231367
#[3,] -0.1241443 -0.1231367 1.0000000
We can also use a service R function cov2cor
to convert covariance to correlation:
cov2cor(V)
# [,1] [,2] [,3]
#[1,] 1.0000000 -0.2455668 -0.1241443
#[2,] -0.2455668 1.0000000 -0.1231367
#[3,] -0.1241443 -0.1231367 1.0000000
Upvotes: 3