Aggregating columns

Question

I have a data frame of n columns and r rows. I want to determine which column is correlated most with column 1, and then aggregate these two columns. The aggregated column will be considered the new column 1. Then, I remove the column that is correlated most from the set. Thus, the size of the date is decreased by one column. I then repeat the process, until the data frame result has has n columns, with the second column being the aggregation of two columns, the third column being the aggregation of three columns, etc. I am therefore wondering if there is an efficient or quicker way to get to the result I'm going for. I've tried various things, but without success so far. Any suggestions?

n <- 5
r <- 6


> df
    X1   X2   X3   X4   X5
1 0.32 0.88 0.12 0.91 0.18
2 0.52 0.61 0.44 0.19 0.65
3 0.84 0.71 0.50 0.67 0.36
4 0.12 0.30 0.72 0.40 0.05
5 0.40 0.62 0.48 0.39 0.95
6 0.55 0.28 0.33 0.81 0.60

This is what result should look like:

> result
    X1   X2   X3   X4   X5
1 0.32 0.50 1.38 2.29 2.41
2 0.52 1.17 1.78 1.97 2.41
3 0.84 1.20 1.91 2.58 3.08
4 0.12 0.17 0.47 0.87 1.59
5 0.40 1.35 1.97 2.36 2.84
6 0.55 1.15 1.43 2.24 2.57

coffeinjunky · Accepted Answer

Try

for (i in 2:n) {
  maxcor <- names(which.max(sapply(temp[,-1, drop=F], function(x) cor(temp[, 1], x) )))
  result[,i] <- temp[,1] + temp[,maxcor] 
  temp[,1] <- result[,i] # Set result as new 1st column
  temp[,maxcor] <- NULL # Remove column
}

The error was caused because in the last iteration, subsetting temp yields a single vector, and standard R behavior is to reduce the class from dataframe to vector in such cases, which causes sapply to pass on only the first element, etc.

One more comment: currently, you are using the most positive correlation, not the strongest correlation, which may also be negative. Make sure this is what you want.

To adress your question in the comment: Note that your old code could be improved by avoiding repeat computation. For instance,

   mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1])

contains the command cor(temp) twice. This means each and every correlation is computed twice. Replacing it with

  cortemp <- cor(temp)
  mch <- match(c(max(cortemp[-1,1])),cortemp[,1])

should cut the computational burden of the initial code line in half.

Aggregating columns

Answers (2)

Related Questions