Reputation: 1163
I would like to compute correlations in R. However I have a lot of missing values. So, I would like to admit in the correlations matrix only correlations that were calculated from at least 10 pairs of values. How to proceed?
Edit: please note that correlation matrix is generated from two big matrices X and Y having same individuals (rows).
Upvotes: 4
Views: 2961
Reputation: 60462
First we generate some example data:
R> x = matrix(rnorm(100), ncol=5)
##Fill in some NA's
R> x[3:15,1] = NA
R> x[2:10,3] = NA
Next we loop through the x
matrix doing a comparsion to detect NA's:
##Create a matrix with where the elements are the
##maximum number of possible comparisons
m = matrix(nrow(x), ncol=ncol(x),nrow=ncol(x))
## This comparison can be made more efficient.
## We only need to do column i with i+1:ncol(x)
## Each list element
for(i in 1:ncol(x)) {
detect_na = is.na(x[,i]==x)
c_sums = colSums(detect_na)
m[i,] = m[i,] - c_sums
}
The matrix m
now contains the number of comparison for each column pair. Now convert the m
matrix in preparation of subsetting:
m = ifelse(m>10, TRUE, NA)
Next we work out the correlation for all column pairs and subset according to m
:
R> matrix(cor(x, use = "complete.obs")[ m], ncol=ncol(m), nrow=nrow(m))
[,1] [,2] [,3] [,4] [,5]
[1,] NA NA NA NA NA
[2,] NA 1.0000 -0.14302 0.35902 -0.3466
[3,] NA -0.1430 1.00000 0.03949 0.6172
[4,] NA 0.3590 0.03949 1.00000 0.1606
[5,] NA -0.3466 0.61720 0.16061 1.0000
Upvotes: 1