Reputation: 1223
I have the following data frame:
df <- data.frame(type = c("planes", "trains", "automobiles"), t1 = c(4, 5, 6), t2 = c(20, 60, 24), t3 = c(100, 120, 72), t4 = c(800, 360, 144))
df
type t1 t2 t3 t4
1 planes 4 20 100 800
2 trains 5 60 120 360
3 automobiles 6 24 72 144
I'd now like to write a function that transform columns 3:5 to be running aggregations across each row (e.g., t2 = t1+t2 and t3 = t1+t2+t3), so that my new data frame looks like this:
new_df
type t1 t2 t3 t4
1 planes 4 24 124 924
2 trains 5 65 185 545
3 automobiles 6 30 102 246
Upvotes: 2
Views: 71
Reputation: 44688
Just to add to @bgoldst even faster for the many column small amount of rows situation would be a for loop:
brandon <- function(df) {
part <- as.matrix(df[2:ncol(df)])
for(x in 1:nrow(df)) {
df[x,2:ncol(df)] <- cumsum((part[x,]))
}
return(df)
}
set.seed(1); R <- 4; C <- 1e4; df <- cbind(data.frame(type=sample(c('planes','trains','automobiles'),R,replace=T)),matrix(sample(10,R*C,replace=T),R));
M <- 3; system.time({ replicate(M,josilber(df)); });
# user system elapsed
# 10.96 0.00 10.95
M <- 3; system.time({ replicate(M,bgoldst(df)); });
# user system elapsed
# 14.056 0.000 14.040
M <- 3; system.time({ replicate(M,brandon(df)); });
# user system elapsed
# 0.400 0.000 0.401
identical(bgoldst(df),brandon(df))
# [1] TRUE
Not at all true for the many rows few columns case.
Upvotes: 2
Reputation: 35324
Here's a bit of an odd solution using Reduce()
:
df[-1] <- Reduce(function(a,b) cbind(a,b+a[,ncol(a)]),c(list(as.matrix(df[2])),df[-1:-2]));
df;
## type t1 t2 t3 t4
## 1 planes 4 24 124 924
## 2 trains 5 65 185 545
## 3 automobiles 6 30 102 246
There's a significant performance benefit here for few-column-many-row data sets (which is more common than many-column-few-row, at least in my experience), because this solution will effectively iterate over columns rather than rows. Here's a demo with 10,000 rows, 4 columns:
set.seed(1); R <- 1e4; df <- data.frame(type=sample(c('planes','trains','automobiles'),R,replace=T), t1=sample(10,R,replace=T), t2=sample(10,R,replace=T), t3=sample(10,R,replace=T), t4=sample(10,R,replace=T) );
bgoldst <- function(df) { df[-1] <- Reduce(function(a,b) cbind(a,b+a[,ncol(a)]),c(list(as.matrix(df[2])),df[-1:-2])); df; };
josilber <- function(df) { df[-1] <- t(apply(df[-1], 1, cumsum)); df; };
M <- 250; system.time({ replicate(M,josilber(df)); });
## user system elapsed
## 11.781 0.016 11.835
M <- 250; system.time({ replicate(M,bgoldst(df)); });
## user system elapsed
## 1.187 0.000 1.191
identical(bgoldst(df),josilber(df));
## [1] TRUE
Although to be fair, here's a performance test with a many-column-few-row data set, where @josilber's apply()
solution outperforms my own, although by a lesser margin. This uses 4 rows and 10,000 columns, to mirror my first performance test (but had to use a smaller M
, actually M=3
, because columns seem to be more CPU-demanding than rows, regardless of the solution):
set.seed(1); R <- 4; C <- 1e4; df <- cbind(data.frame(type=sample(c('planes','trains','automobiles'),R,replace=T)),matrix(sample(10,R*C,replace=T),R));
M <- 3; system.time({ replicate(M,josilber(df)); });
## user system elapsed
## 10.188 0.000 10.181
M <- 3; system.time({ replicate(M,bgoldst(df)); });
## user system elapsed
## 13.484 0.000 13.492
identical(bgoldst(df),josilber(df));
## [1] TRUE
And a minor point I just noticed, my solution correctly handles the degenerate case of just 1 addend column, whereas the apply()
solution errs on it with Error in `[<-.data.frame`(`*tmp*`, -1, value = c(4, 5, 6)) : replacement has 1 row, data has 3"
; I believe what happens is that apply()
flattens a single-column output to a vector, and then the t()
call turns it into a 1x3 matrix, thus not aligning with the target of the assignment, which is a 3x1 data.frame:
df <- data.frame(type = c("planes", "trains", "automobiles"), t1 = c(4, 5, 6));
josilber(df);
## Error in `[<-.data.frame`(`*tmp*`, -1, value = c(4, 5, 6)) :
## replacement has 1 row, data has 3
bgoldst(df);
## type t1
## 1 planes 4
## 2 trains 5
## 3 automobiles 6
Upvotes: 2
Reputation: 44340
You can do this with apply
, replacing the relevant part of your data frame with the result after you call cumsum
on each row.
df[-1] <- t(apply(df[-1], 1, cumsum))
df
# type t1 t2 t3 t4
# 1 planes 4 24 124 924
# 2 trains 5 65 185 545
# 3 automobiles 6 30 102 246
Upvotes: 2