YYY
YYY

Reputation: 625

How to exclude the columns with high correlation?

There are some highly correlated columns in the data.frame. I wondered how to exclude them with some efficient way, such as matrix manipulation on the correlation matrix. Here is the sample code:

a=c(1,2,3,4,5)
df=data.frame(a=a,b=a*2,c=c(2,1,1,2,9),d=a*3)
round(cor(df,df),6)

output:

     a        b        c        d
a 1.000000 1.000000 0.699379 1.000000
b 1.000000 1.000000 0.699379 1.000000
c 0.699379 0.699379 1.000000 0.699379
d 1.000000 1.000000 0.699379 1.000000

Ideally b and c will be excluded because its correlation are 1 with a.

Upvotes: 1

Views: 708

Answers (1)

Andrew Cassidy
Andrew Cassidy

Reputation: 2998

discretize = function(adj.m) {
 drops = c()
 valence = apply(adj.m,1,sum) + apply(adj.m,2,sum)
 max.valence = max(valence)
 while(max.valence > 0) {
   drop.vertex = which(valence == max.valence)[1]

   drops = append(drops, drop.vertex)
   adj.m[drop.vertex,] = 0
   adj.m[,drop.vertex] = 0

   valence = apply(adj.m,1,sum) + apply(adj.m, 2, sum)
   max.valence = max(valence)
 }
 drops
}

m <- as.data.frame(df)
cor.threshold = ???? # Set this yourself
nvi <- as.double(rep(0,ncol(m)))
names(nvi) <- names(m)

# preform pairwise correlation dropping
corm=abs(cor(m))
corm[lower.tri(corm,diag=T)]=0
cor.indices = which(corm > cor.threshold, arr.ind = T)
adjm = matrix(0, nrow=nrow(corm), ncol=ncol(corm))
adjm[cor.indices] = 1
drop = discretize(adjm)
nvi[drop] = 1
m <- m[, !colnames(m) %in% names(which(nvi == 1))]

Upvotes: 1

Related Questions