correlation by row, within data frame

Question

I am attempting to calculate the correlation between all the rows of a large data frame, and so far have come up with a simple for-loop that works. For example:

name <- c("a", "b", "c", "d")
col1 <- c(43.78, 43.84, 37.92, 31.72)
col2 <- c(43.80, 43.40, 37.64, 31.62)
col3 <- c(43.14, 42.85, 37.54, 31.74)
df <- data.frame(name, col1, col2, col3)
cor.df <- data.frame(name1=NA, name2=NA,correl=NA)

for(i in 1: (nrow(df) - 1))  {
  for(j in (i+1): nrow(df) ) {
    v1 <- as.numeric( df[i, 2:ncol(df)] )
    v2 <- as.numeric( df[j, 2:ncol(df)] )
    correl <- cor(v1, v2)

    name1 <- df[i, "name"]
    name2 <- df[j, "name"]

    dftemp <- data.frame(name1, name2, correl)
    cor.df <- rbind(cor.df, dftemp)
   }
}

na.omit(cor.df)

#    name1 name2     correl
#     a     b      0.8841255
#     a     c      0.6842705
#     a     d     -0.6491118
#     b     c      0.9457125
#     b     d     -0.2184630
#     c     d      0.1105508

Given the large data frame and the inefficient for-loop, the correlation computation takes a long time. Would anyone have any suggestions as to how to make it faster? Note that I have many data frames in a list, so I can use lapply (but have not figured out how to write the line of code)

correlation by row, within data frame

Answers (1)

Related Questions