Mallika
Mallika

Reputation: 55

Error in R: subscript out of bounds

So I'm trying to do something very simple. Loop over a data frame and calculate the max corelation coefficient between a pair of columns.

I am trying to do this in R.

My data frame has been read using fread()

Here's my code: I declared max=-1, a=0andb=0in the starting.

for(i in 2:1933)
{
    for(j in i+1:1934)
    {
        if(is.numeric(data[[i]]) && is.numeric(data[[j]]))
        {
            if(isTRUE(sd(data[[i]], na.rm=TRUE) !=0) && isTRUE(sd(data[[j]], na.rm=TRUE) !=0))
            {
                c = cor(data[[i]], data[[j]], use="pairwise.complete.obs")
                if(isTRUE(c>=max))
                {
                    max = c
                    a = i
                    b = j
                }
            }
        }
    }
}

The error I get is

Error in .subset2(x, i, exact = exact) : subscript out of bounds

I do have 1934 columns, I can't figure out the problem. Am I missing something fairly obvious?

Upvotes: 1

Views: 15169

Answers (2)

jlhoward
jlhoward

Reputation: 59335

There's a much easier way to do this: cor(...) takes a matrix (nr X nc) and returns a new matrix (nc X nc) with the correlation coefficient of every column against every other column. The rest is pretty straightforward:

library(data.table)   # to simulate fread(...)
set.seed(1)           # for reproducibble example
dt <- as.data.table(matrix(1:50+rnorm(50,sd=5), ncol=5)) # create reproducible example


result <- cor(dt, use="pairwise.complete.obs")       # matrix of correlation coefficients
diag(result) <- NA                                   # set diagonals to NA
max(result, na.rm=TRUE)                              # maximum correlation coefficient
# [1] 0.7165304
which(result==max(result, na.rm=TRUE), arr.ind=TRUE) # location of max
#    row col
# V3   3   2
# V2   2   3

There are two locations because of course the correlation between col 2 and 3 is the same as the correlation between cols 3 and 2.

Upvotes: 2

sairaamv
sairaamv

Reputation: 86

Try this:::

    drop_list <- NULL

#Guess the first column iS ID Column
feature.names <- names(data)[2:length(names(data)]

for(f in feature.names){
  if(sd(data[[f]], na.rm=TRUE) == 0.0 | is.numeric(data[[f]])==FALSE)
     {
     drop_list <- c(drop_list, f)
  }
}

data <- data[,!(names(data) %in% drop_list)]

corr_data <- cor(data, use="pairwise.complete.obs")


##remove Correlation between same variables
for(i in 1:dim(corr_data)[1]){corr_data[i,i] <- -99 }

#Please try to sort the correlation data.frame accordingly with which function as Howard suggested

Cheers

Upvotes: 0

Related Questions