Roshini
Roshini

Reputation: 723

cumsum using ddply

I need to use group by in levels with ddply or aggregate if that's easier. I am not really sure how to do this as I need to use cumsum as my aggregate function. This is what my data looks like:

level1      level2  hour     product 
A           tea     0          7
A           tea     1          2
A           tea     2          9
A           coffee  17         7
A           coffee  18         2
A           coffee  20         4
B           coffee  0          2
B           coffee  1          3
B           coffee  2          4
B           tea     21         3
B           tea     22         1

expected output:

A     tea     0   7
A     tea     1   9
A     tea     2   18
A     coffee  17  7
A     coffee  18  9
A     coffee  20  13
B     coffee  0   2
B     coffee  1   5
B     coffee  2   9
B     tea     21  3
B     tea     22  4

I tried using

ddply(dd,c("level1","level2","hour"),summarise,cumsum(product))

but that doesn't sum up which I think is because the hour column is being used for group by and its being split by that..I think.. I am not sure I completely understand how aggregate works here. Is there any way I could get the required output using aggregate or ddply?

Upvotes: 13

Views: 5219

Answers (3)

Tanner33
Tanner33

Reputation: 140

Folks referencing this now may wish to avoid using plyr::ddply. Here is a dplyr solution.

library(dplyr)
output <- dd %>%
  group_by(level1, level2) %>%
  mutate(product_sum = cumsum(product)) %>%
  ungroup()
output

Note that product_sum is where the new column is named. You could use product instead. Then, the original product column would be overwritten as it appears the original question may have preferred to do that and not have both columns in the output.

Data:

level1 <- c(rep("A",6), rep("B",5))
level2 <- c(rep("tea",3), rep("coffee",6), rep("tea",2))
hour <- c(0,1,2,17,18,20,0,1,2,21,22)
product <- c(7,2,9,7,2,4,2,3,4,3,1)

dd <- data.frame(level1, level2, hour, product)

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193667

Here's a solution in base R using ave and within:

within(mydf, {
  cumsumProduct <- ave(product, level1, level2, FUN = cumsum)
})
#    level1 level2 hour product cumsumProduct
# 1       A    tea    0       7             7
# 2       A    tea    1       2             9
# 3       A    tea    2       9            18
# 4       A coffee   17       7             7
# 5       A coffee   18       2             9
# 6       A coffee   20       4            13
# 7       B coffee    0       2             2
# 8       B coffee    1       3             5
# 9       B coffee    2       4             9
# 10      B    tea   21       3             3
# 11      B    tea   22       1             4

Of course, if you wanted to drop the existing product column, you can change the command to the following to overwrite the current "product" column:

within(mydf, {
  product <- ave(product, level1, level2, FUN = cumsum)
})

Your current approach doesn't work in part because you've included "hour" as one of your grouping variables. In other words, it is seeing the combination of "A + tea + 0" as different from "A + tea + 1", but from your desired output, you seem to simply want the combination of "A + tea" to be the group.

aggregate won't work as you expect, because it will condense everything into a data.frame with the same number of rows as the number of unique combinations of "level1" and "level2", in this case, 4 rows. The aggregated column would be a list. The values would be correct, but it would be less useful.

Here's aggregate and its output:

> aggregate(product ~ level1 + level2, mydf, cumsum)
  level1 level2  product
1      A coffee 7, 9, 13
2      B coffee  2, 5, 9
3      A    tea 7, 9, 18
4      B    tea     3, 4

Upvotes: 16

Arun
Arun

Reputation: 118869

you should use transform instead of summarise:

# you should probably order your `level2` first
dd$level2 <- factor(dd$level2, levels=c("tea", "coffee"))
# and transform using level1 and level2 alone, not hour
# if you use hour, the groups will be for each row
ddply(dd, .(level1, level2), transform, product=cumsum(product))

#    level1 level2 hour product
# 1       A    tea    0       7
# 2       A    tea    1       9
# 3       A    tea    2      18
# 4       A coffee   17       7
# 5       A coffee   18       9
# 6       A coffee   20      13
# 7       B    tea   21       3
# 8       B    tea   22       4
# 9       B coffee    0       2
# 10      B coffee    1       5
# 11      B coffee    2       9

Upvotes: 7

Related Questions