bshelt141
bshelt141

Reputation: 1223

Running aggregations across columns in R

I have the following data frame:

df <- data.frame(type = c("planes", "trains", "automobiles"), t1 = c(4, 5, 6), t2 = c(20, 60, 24), t3 = c(100, 120, 72), t4 = c(800, 360, 144))

df
         type t1 t2  t3  t4
1      planes  4 20 100 800
2      trains  5 60 120 360
3 automobiles  6 24  72 144

I'd now like to write a function that transform columns 3:5 to be running aggregations across each row (e.g., t2 = t1+t2 and t3 = t1+t2+t3), so that my new data frame looks like this:

new_df
         type  t1  t2  t3  t4
1      planes   4  24  124 924
2      trains   5  65  185 545
3 automobiles   6  30  102 246

Upvotes: 2

Views: 71

Answers (3)

Brandon Bertelsen
Brandon Bertelsen

Reputation: 44688

Just to add to @bgoldst even faster for the many column small amount of rows situation would be a for loop:

brandon <- function(df) {
  part <- as.matrix(df[2:ncol(df)])
  for(x in 1:nrow(df)) { 
    df[x,2:ncol(df)] <- cumsum((part[x,]))
    }
  return(df)
}

set.seed(1); R <- 4; C <- 1e4; df <- cbind(data.frame(type=sample(c('planes','trains','automobiles'),R,replace=T)),matrix(sample(10,R*C,replace=T),R));

M <- 3; system.time({ replicate(M,josilber(df)); });
#   user  system elapsed 
#  10.96    0.00   10.95 
M <- 3; system.time({ replicate(M,bgoldst(df)); });
#   user  system elapsed 
# 14.056   0.000  14.040 
M <- 3; system.time({ replicate(M,brandon(df)); });
#   user  system elapsed 
#  0.400   0.000   0.401 
identical(bgoldst(df),brandon(df))
# [1] TRUE 

Not at all true for the many rows few columns case.

Upvotes: 2

bgoldst
bgoldst

Reputation: 35324

Here's a bit of an odd solution using Reduce():

df[-1] <- Reduce(function(a,b) cbind(a,b+a[,ncol(a)]),c(list(as.matrix(df[2])),df[-1:-2]));
df;
##          type t1 t2  t3  t4
## 1      planes  4 24 124 924
## 2      trains  5 65 185 545
## 3 automobiles  6 30 102 246

There's a significant performance benefit here for few-column-many-row data sets (which is more common than many-column-few-row, at least in my experience), because this solution will effectively iterate over columns rather than rows. Here's a demo with 10,000 rows, 4 columns:

set.seed(1); R <- 1e4; df <- data.frame(type=sample(c('planes','trains','automobiles'),R,replace=T), t1=sample(10,R,replace=T), t2=sample(10,R,replace=T), t3=sample(10,R,replace=T), t4=sample(10,R,replace=T) );
bgoldst <- function(df) { df[-1] <- Reduce(function(a,b) cbind(a,b+a[,ncol(a)]),c(list(as.matrix(df[2])),df[-1:-2])); df; };
josilber <- function(df) { df[-1] <- t(apply(df[-1], 1, cumsum)); df; };
M <- 250; system.time({ replicate(M,josilber(df)); });
##    user  system elapsed
##  11.781   0.016  11.835
M <- 250; system.time({ replicate(M,bgoldst(df)); });
##    user  system elapsed
##   1.187   0.000   1.191
identical(bgoldst(df),josilber(df));
## [1] TRUE

Although to be fair, here's a performance test with a many-column-few-row data set, where @josilber's apply() solution outperforms my own, although by a lesser margin. This uses 4 rows and 10,000 columns, to mirror my first performance test (but had to use a smaller M, actually M=3, because columns seem to be more CPU-demanding than rows, regardless of the solution):

set.seed(1); R <- 4; C <- 1e4; df <- cbind(data.frame(type=sample(c('planes','trains','automobiles'),R,replace=T)),matrix(sample(10,R*C,replace=T),R));
M <- 3; system.time({ replicate(M,josilber(df)); });
##    user  system elapsed
##  10.188   0.000  10.181
M <- 3; system.time({ replicate(M,bgoldst(df)); });
##    user  system elapsed
##  13.484   0.000  13.492
identical(bgoldst(df),josilber(df));
## [1] TRUE

And a minor point I just noticed, my solution correctly handles the degenerate case of just 1 addend column, whereas the apply() solution errs on it with Error in `[<-.data.frame`(`*tmp*`, -1, value = c(4, 5, 6)) : replacement has 1 row, data has 3"; I believe what happens is that apply() flattens a single-column output to a vector, and then the t() call turns it into a 1x3 matrix, thus not aligning with the target of the assignment, which is a 3x1 data.frame:

df <- data.frame(type = c("planes", "trains", "automobiles"), t1 = c(4, 5, 6));
josilber(df);
## Error in `[<-.data.frame`(`*tmp*`, -1, value = c(4, 5, 6)) :
##   replacement has 1 row, data has 3
bgoldst(df);
##          type t1
## 1      planes  4
## 2      trains  5
## 3 automobiles  6

Upvotes: 2

josliber
josliber

Reputation: 44340

You can do this with apply, replacing the relevant part of your data frame with the result after you call cumsum on each row.

df[-1] <- t(apply(df[-1], 1, cumsum))
df
#          type t1 t2  t3  t4
# 1      planes  4 24 124 924
# 2      trains  5 65 185 545
# 3 automobiles  6 30 102 246

Upvotes: 2

Related Questions