user5946647
user5946647

Reputation:

Aggregating columns

I have a data frame of n columns and r rows. I want to determine which column is correlated most with column 1, and then aggregate these two columns. The aggregated column will be considered the new column 1. Then, I remove the column that is correlated most from the set. Thus, the size of the date is decreased by one column. I then repeat the process, until the data frame result has has n columns, with the second column being the aggregation of two columns, the third column being the aggregation of three columns, etc. I am therefore wondering if there is an efficient or quicker way to get to the result I'm going for. I've tried various things, but without success so far. Any suggestions?

n <- 5
r <- 6


> df
    X1   X2   X3   X4   X5
1 0.32 0.88 0.12 0.91 0.18
2 0.52 0.61 0.44 0.19 0.65
3 0.84 0.71 0.50 0.67 0.36
4 0.12 0.30 0.72 0.40 0.05
5 0.40 0.62 0.48 0.39 0.95
6 0.55 0.28 0.33 0.81 0.60

This is what result should look like:

> result
    X1   X2   X3   X4   X5
1 0.32 0.50 1.38 2.29 2.41
2 0.52 1.17 1.78 1.97 2.41
3 0.84 1.20 1.91 2.58 3.08
4 0.12 0.17 0.47 0.87 1.59
5 0.40 1.35 1.97 2.36 2.84
6 0.55 1.15 1.43 2.24 2.57

Upvotes: 1

Views: 96

Answers (2)

coffeinjunky
coffeinjunky

Reputation: 11514

Try

for (i in 2:n) {
  maxcor <- names(which.max(sapply(temp[,-1, drop=F], function(x) cor(temp[, 1], x) )))
  result[,i] <- temp[,1] + temp[,maxcor] 
  temp[,1] <- result[,i] # Set result as new 1st column
  temp[,maxcor] <- NULL # Remove column
}

The error was caused because in the last iteration, subsetting temp yields a single vector, and standard R behavior is to reduce the class from dataframe to vector in such cases, which causes sapply to pass on only the first element, etc.

One more comment: currently, you are using the most positive correlation, not the strongest correlation, which may also be negative. Make sure this is what you want.


To adress your question in the comment: Note that your old code could be improved by avoiding repeat computation. For instance,

   mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1])

contains the command cor(temp) twice. This means each and every correlation is computed twice. Replacing it with

  cortemp <- cor(temp)
  mch <- match(c(max(cortemp[-1,1])),cortemp[,1])

should cut the computational burden of the initial code line in half.

Upvotes: 0

shekeine
shekeine

Reputation: 1465

I think most of the slowness and eventual crash comes from memory overheads during the loop and not from the correlations (though that could be improved too as @coffeeinjunky says). This is most likely as a result of the way data.frames are modified in R. Consider switching to data.tables and take advantage of their "assignment by reference" paradigm. For example, below is your code translated into data.table syntax. You can time the two loops, compare perfomance and comment the results. cheers.

n <- 5L
r <- 6L

result <- setDT(data.frame(matrix(NA,nrow=r,ncol=n)))
temp <- copy(df) # Create a temporary data frame in which I calculate the correlations
set(result, j=1L, value=temp[[1]]) # The first column is the same

for (icol in as.integer(2:n)) {
  mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1]) # Determine which are correlated most
  set(x=result, i=NULL, j=as.integer(icol), value=(temp[[1]] + temp[[mch]]))# Aggregate and place result in results datatable
  set(x=temp, i=NULL, j=1L, value=result[[icol]])# Set result as new 1st column
  set(x=temp, i=NULL, j=as.integer(mch), value=NULL) # Remove column
}

Upvotes: 2

Related Questions