Reputation: 55
So I'm trying to do something very simple. Loop over a data frame and calculate the max corelation coefficient between a pair of columns.
I am trying to do this in R.
My data frame has been read using fread()
Here's my code: I declared max=-1, a=0
andb=0
in the starting.
for(i in 2:1933)
{
for(j in i+1:1934)
{
if(is.numeric(data[[i]]) && is.numeric(data[[j]]))
{
if(isTRUE(sd(data[[i]], na.rm=TRUE) !=0) && isTRUE(sd(data[[j]], na.rm=TRUE) !=0))
{
c = cor(data[[i]], data[[j]], use="pairwise.complete.obs")
if(isTRUE(c>=max))
{
max = c
a = i
b = j
}
}
}
}
}
The error I get is
Error in .subset2(x, i, exact = exact) : subscript out of bounds
I do have 1934 columns, I can't figure out the problem. Am I missing something fairly obvious?
Upvotes: 1
Views: 15169
Reputation: 59335
There's a much easier way to do this: cor(...)
takes a matrix (nr X nc
) and returns a new matrix (nc X nc
) with the correlation coefficient of every column against every other column. The rest is pretty straightforward:
library(data.table) # to simulate fread(...)
set.seed(1) # for reproducibble example
dt <- as.data.table(matrix(1:50+rnorm(50,sd=5), ncol=5)) # create reproducible example
result <- cor(dt, use="pairwise.complete.obs") # matrix of correlation coefficients
diag(result) <- NA # set diagonals to NA
max(result, na.rm=TRUE) # maximum correlation coefficient
# [1] 0.7165304
which(result==max(result, na.rm=TRUE), arr.ind=TRUE) # location of max
# row col
# V3 3 2
# V2 2 3
There are two locations because of course the correlation between col 2 and 3 is the same as the correlation between cols 3 and 2.
Upvotes: 2
Reputation: 86
Try this:::
drop_list <- NULL
#Guess the first column iS ID Column
feature.names <- names(data)[2:length(names(data)]
for(f in feature.names){
if(sd(data[[f]], na.rm=TRUE) == 0.0 | is.numeric(data[[f]])==FALSE)
{
drop_list <- c(drop_list, f)
}
}
data <- data[,!(names(data) %in% drop_list)]
corr_data <- cor(data, use="pairwise.complete.obs")
##remove Correlation between same variables
for(i in 1:dim(corr_data)[1]){corr_data[i,i] <- -99 }
#Please try to sort the correlation data.frame accordingly with which function as Howard suggested
Cheers
Upvotes: 0