Reputation: 625
There are some highly correlated columns in the data.frame. I wondered how to exclude them with some efficient way, such as matrix manipulation on the correlation matrix. Here is the sample code:
a=c(1,2,3,4,5)
df=data.frame(a=a,b=a*2,c=c(2,1,1,2,9),d=a*3)
round(cor(df,df),6)
output:
a b c d
a 1.000000 1.000000 0.699379 1.000000
b 1.000000 1.000000 0.699379 1.000000
c 0.699379 0.699379 1.000000 0.699379
d 1.000000 1.000000 0.699379 1.000000
Ideally b and c will be excluded because its correlation are 1 with a.
Upvotes: 1
Views: 708
Reputation: 2998
discretize = function(adj.m) {
drops = c()
valence = apply(adj.m,1,sum) + apply(adj.m,2,sum)
max.valence = max(valence)
while(max.valence > 0) {
drop.vertex = which(valence == max.valence)[1]
drops = append(drops, drop.vertex)
adj.m[drop.vertex,] = 0
adj.m[,drop.vertex] = 0
valence = apply(adj.m,1,sum) + apply(adj.m, 2, sum)
max.valence = max(valence)
}
drops
}
m <- as.data.frame(df)
cor.threshold = ???? # Set this yourself
nvi <- as.double(rep(0,ncol(m)))
names(nvi) <- names(m)
# preform pairwise correlation dropping
corm=abs(cor(m))
corm[lower.tri(corm,diag=T)]=0
cor.indices = which(corm > cor.threshold, arr.ind = T)
adjm = matrix(0, nrow=nrow(corm), ncol=ncol(corm))
adjm[cor.indices] = 1
drop = discretize(adjm)
nvi[drop] = 1
m <- m[, !colnames(m) %in% names(which(nvi == 1))]
Upvotes: 1