exzackley
exzackley

Reputation: 375

Apply a function to groups within a data.frame in R

I am trying to get the cumulative sum of a variable (v) for groups ("a" and "b") within a dataframe. How can I get the result at the bottom -- whose rows are even numbered properly -- into column cs of my dataframe?

> library(nlme)
> g <- factor(c("a","b","a","b","a","b","a","b","a","b","a","b"))
> v <- c(1,4,1,4,1,4,2,8,2,8,2,8)
> cs <- rep(0,12)
> d <- data.frame(g,v,cs)

> d
   g v cs
1  a 1 0
2  b 4 0
3  a 1 0
4  b 4 0
5  a 1 0
6  b 4 0
7  a 2 0
8  b 8 0
9  a 2 0
10 b 8 0
11 a 2 0
12 b 8 0

> r=gapply(d,FUN="cumsum",form=~g, which="v")
>r

$a     
   v   
1  1   
3  2   
5  3  
7  5  
9  7  
11 9  

$b    
    v 
2   4 
4   8 
6  12 
8  20 
10 28 
12 36 

> str(r)
List of 2
 $ a:'data.frame':  6 obs. of  1 variable:
  ..$ v: num [1:6] 1 2 3 5 7 9
 $ b:'data.frame':  6 obs. of  1 variable:
  ..$ v: num [1:6] 4 8 12 20 28 36

I guess I could figure out some laborious way to get the data from those dataframes into d$cs, but there's got to be some easy tweak I'm missing.

Upvotes: 16

Views: 14893

Answers (5)

Ronak Shah
Ronak Shah

Reputation: 389235

Here are few packaged options -

plyr is retired and replaced by dplyr

library(dplyr)

d %>% mutate(cs = cumsum(v), .by = g)

#   g v cs
#1  a 1  1
#2  b 4  4
#3  a 1  2
#4  b 4  8
#5  a 1  3
#6  b 4 12
#7  a 2  5
#8  b 8 20
#9  a 2  7
#10 b 8 28
#11 a 2  9
#12 b 8 36

For larger data, collapse is super fast and it's syntax is very similar to dplyr.

library(collapse)

d |> fgroup_by(g) |> fmutate(cs = cumsum(v))

And to do the same thing in data.table we can do the following

library(data.table)

setDT(d)[, cs := cumsum(v), by = g]

Upvotes: 0

chandler
chandler

Reputation: 856

> library(nlme)
> g <- factor(c("a","b","a","b","a","b","a","b","a","b","a","b"))
> v <- c(1,4,1,4,1,4,2,8,2,8,2,8)
> cs <- rep(0,12)
> d <- data.frame(g,v,cs)
> d <- d[order(d$g),]
> temp <- by(d$v,d$g,cumsum)
> d$cs <- do.call("c",temp)
> d
   g v cs
1  a 1  1
3  a 1  2
5  a 1  3
7  a 2  5
9  a 2  7
11 a 2  9
2  b 4  4
4  b 4  8
6  b 4 12
8  b 8 20
10 b 8 28
12 b 8 36

Another solution using the by function, but I had to order the data first

Upvotes: 0

Joshua Ulrich
Joshua Ulrich

Reputation: 176718

I would use ave. If you look at the source of ave, you'll see it essentially wraps Martin Morgan's solution.

R> g <- factor(c("a","b","a","b","a","b","a","b","a","b","a","b"))
R> v <- c(1,4,1,4,1,4,2,8,2,8,2,8)
R> d <- data.frame(g,v)
R> d$cs <- ave(v, g, FUN=cumsum)
R> d
   g v cs
1  a 1  1
2  b 4  4
3  a 1  2
4  b 4  8
5  a 1  3
6  b 4 12
7  a 2  5
8  b 8 20
9  a 2  7
10 b 8 28
11 a 2  9
12 b 8 36

Upvotes: 10

Martin Morgan
Martin Morgan

Reputation: 46886

split<- is a pretty weird beast

split(d$cs, d$g) <- lapply(split(d$v, d$g), cumsum)

leading to

> d
   g v cs
1  a 1  1
2  b 4  4
3  a 1  2
4  b 4  8
5  a 1  3
6  b 4 12
7  a 2  5
8  b 8 20
9  a 2  7
10 b 8 28
11 a 2  9
12 b 8 36

Upvotes: 13

joran
joran

Reputation: 173697

My tool of choice for these things is the plyr package:

require(plyr)
> ddply(d,.(g),transform,cs = cumsum(v))
   g v cs
1  a 1  1
2  a 1  2
3  a 1  3
4  a 2  5
5  a 2  7
6  a 2  9
7  b 4  4
8  b 4  8
9  b 4 12
10 b 8 20
11 b 8 28
12 b 8 36

Upvotes: 7

Related Questions