How to create groups based on conditions

Question

I have this kind of data:

set.seed(12345)

df <- data.frame(group=rep(c("A"),26), size=c(rep(1000,5),rep(0,3),rep(1000,7),rep(0,3),rep(1000,5),rep(0,3)),
             int=c(rnorm(3,5,1),rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5)),
             out=c(rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5),rnorm(3,5,1)))

Here is desired output:

   group size      int      out  id  id2
1      A 1000 5.585529 0.000000   1    1
2      A 1000 5.709466 0.000000   1    1
3      A 1000 4.890697 0.000000   1    1
4      A 1000 0.000000 0.000000   1    1
5      A 1000 0.000000 0.000000   1    1
6      A    0 0.000000 4.080678   1    1
7      A    0 0.000000 4.883752   NA   1
8      A    0 0.000000 6.817312   NA   1
9      A 1000 4.546503 0.000000   2    2
10     A 1000 5.605887 0.000000   2    2
11     A 1000 3.182044 0.000000   2    2
12     A 1000 0.000000 0.000000   2    2
13     A 1000 0.000000 0.000000   2    2
14     A 1000 0.000000 0.000000   2    2
15     A 1000 0.000000 0.000000   2    2
16     A    0 0.000000 5.370628   2    2
17     A    0 0.000000 5.520216   NA   2
18     A    0 0.000000 4.249468   NA   2
19     A 1000 5.630099 0.000000   3    3 
20     A 1000 4.723816 0.000000   3    3
21     A 1000 4.715840 0.000000   3    3
22     A 1000 0.000000 0.000000   3    3
23     A 1000 0.000000 0.000000   3    3
24     A    0 0.000000 5.816900   3    3
25     A    0 0.000000 4.113642   NA   3
26     A    0 0.000000 4.668422   NA   3

The new group id is created based on the data above. I believe rle function is the way to go, but I cannot figure it out to the end.

Frank · Accepted Answer

A variation on @ycw's answer:

library(data.table)
setDT(df)

df[, g := rleid( z <- out==0 | shift(out==0) )*NA^(!z) ]

    group size      int      out  g
 1:     A 1000 5.585529 0.000000  1
 2:     A 1000 5.709466 0.000000  1
 3:     A 1000 4.890697 0.000000  1
 4:     A 1000 0.000000 0.000000  1
 5:     A 1000 0.000000 0.000000  1
 6:     A    0 0.000000 4.080678  1
 7:     A    0 0.000000 4.883752 NA
 8:     A    0 0.000000 6.817312 NA
 9:     A 2000 4.546503 0.000000  3
10:     A 2000 5.605887 0.000000  3
11:     A 2000 3.182044 0.000000  3
12:     A 2000 0.000000 0.000000  3
13:     A 2000 0.000000 0.000000  3
14:     A 2000 0.000000 0.000000  3
15:     A 2000 0.000000 0.000000  3
16:     A    0 0.000000 5.370628  3
17:     A    0 0.000000 5.520216 NA
18:     A    0 0.000000 4.249468 NA
19:     A 5000 5.630099 0.000000  5
20:     A 5000 4.723816 0.000000  5
21:     A 5000 4.715840 0.000000  5
22:     A 5000 0.000000 0.000000  5
23:     A 5000 0.000000 0.000000  5
24:     A    0 0.000000 5.816900  5
25:     A    0 0.000000 4.113642 NA
26:     A    0 0.000000 4.668422 NA
    group size      int      out  g

(@ycw suggested I make it a separate answer. Also, the NA^x trick is borrowed from @akrun.)

For the OP's group numbers, this extra step works:

df[, g := match(g, unique(na.omit(g)))]

For the extension the OP added ("id2"):

w = df[.(unique(na.omit(g))), on=.(g), which=TRUE, mult="first"]
df[, g2 := cumsum(.I %in% w)]

So in the end we have...

    group size      int      out  g g2
 1:     A 1000 5.585529 0.000000  1  1
 2:     A 1000 5.709466 0.000000  1  1
 3:     A 1000 4.890697 0.000000  1  1
 4:     A 1000 0.000000 0.000000  1  1
 5:     A 1000 0.000000 0.000000  1  1
 6:     A    0 0.000000 4.080678  1  1
 7:     A    0 0.000000 4.883752 NA  1
 8:     A    0 0.000000 6.817312 NA  1
 9:     A 2000 4.546503 0.000000  2  2
10:     A 2000 5.605887 0.000000  2  2
11:     A 2000 3.182044 0.000000  2  2
12:     A 2000 0.000000 0.000000  2  2
13:     A 2000 0.000000 0.000000  2  2
14:     A 2000 0.000000 0.000000  2  2
15:     A 2000 0.000000 0.000000  2  2
16:     A    0 0.000000 5.370628  2  2
17:     A    0 0.000000 5.520216 NA  2
18:     A    0 0.000000 4.249468 NA  2
19:     A 5000 5.630099 0.000000  3  3
20:     A 5000 4.723816 0.000000  3  3
21:     A 5000 4.715840 0.000000  3  3
22:     A 5000 0.000000 0.000000  3  3
23:     A 5000 0.000000 0.000000  3  3
24:     A    0 0.000000 5.816900  3  3
25:     A    0 0.000000 4.113642 NA  3
26:     A    0 0.000000 4.668422 NA  3
    group size      int      out  g g2

For base R analogues, there is an SO Q&A on how to make rleid without data.table; shift can be constructed manually (it's just a lag operator); and there are other ways to find w (maybe tapply?).

How to create groups based on conditions

Answers (2)

Related Questions