Using dlply with unsplit

Question

I want to append a column to a dataframe that has the result of a cumulative function. I can accomplish this with unsplit/split, like this

> set.seed(3)
> d <- data.frame(type=sample(c('a','b'),10,replace=TRUE), val=rnorm(10))
> d
   type         val
1     a  0.03012394
2     b  0.08541773
3     a  1.11661021
4     a -1.21885742
5     b  1.26736872
6     b -0.74478160
7     a -1.13121857
8     a -0.71635849
9     b  0.25265237
10    b  0.15204571

So I use split/lapply/unsplit to get my desired result

> d$sum <- unsplit(lapply(split(d,d$type), function(x) { cumsum(x$val)}), d$type)
> d
   type         val         sum
1     a  0.03012394  0.03012394
2     b  0.08541773  0.08541773
3     a  1.11661021  1.14673416
4     a -1.21885742 -0.07212326
5     b  1.26736872  1.35278645
6     b -0.74478160  0.60800486
7     a -1.13121857 -1.20334183
8     a -0.71635849 -1.91970032
9     b  0.25265237  0.86065723
10    b  0.15204571  1.01270293

And this is the desired result. But I'd really like to use the simplified syntax of plyr in this case. So I tried

> d$sum2 <- unsplit(dlply(d, .(type), summarise, cumsum(val)), d$type)
Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': '1', '2', '3', '4', '5'

The output of dlply and the lapply/split are almost the same, except that the dlply has some extra junk that I think unsplit will ignore, and the dlply output has re-indexed the row.names. I think this latter is what the complaint is.

Also to note that I am aware that I can approach this with ddply/transform

> ddply(d, .(type), transform, sum2=cumsum(val))                                
   type         val         sum        sum2
1     a  0.03012394  0.03012394  0.03012394
2     a  1.11661021  1.14673416  1.14673416
3     a -1.21885742 -0.07212326 -0.07212326
4     a -1.13121857 -1.20334183 -1.20334183
5     a -0.71635849 -1.91970032 -1.91970032
6     b  0.08541773  0.08541773  0.08541773
7     b  1.26736872  1.35278645  1.35278645
8     b -0.74478160  0.60800486  0.60800486
9     b  0.25265237  0.86065723  0.86065723
10    b  0.15204571  1.01270293  1.01270293

This won't work in my case, because as you can see, this has the side effect of rearranging the rows to be out of order. If there's some argument to ddply that would not rearrange the rows, then this would be perfect for my purposes.

Henrik · Accepted Answer

Perhaps you could try dplyr instead? In contrast to ddply, it keeps the original order.

library(dplyr)
d %.%
  group_by(type) %.%
  mutate(sum = cumsum(val))
# Source: local data frame [10 x 3]
# Groups: type
# 
#    type         val         sum
# 1     a  0.03012394  0.03012394
# 2     b  0.08541773  0.08541773
# 3     a  1.11661021  1.14673416
# 4     a -1.21885742 -0.07212326
# 5     b  1.26736872  1.35278645
# 6     b -0.74478160  0.60800486
# 7     a -1.13121857 -1.20334183
# 8     a -0.71635849 -1.91970032
# 9     b  0.25265237  0.86065723
# 10    b  0.15204571  1.01270293

Using dlply with unsplit

Answers (2)

Related Questions