Reputation: 4229
I have this kind of data:
set.seed(12345)
df <- data.frame(group=rep(c("A"),26), size=c(rep(1000,5),rep(0,3),rep(1000,7),rep(0,3),rep(1000,5),rep(0,3)),
int=c(rnorm(3,5,1),rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5)),
out=c(rep(0,5),rnorm(3,5,1),rep(0,7),rnorm(3,5,1),rep(0,5),rnorm(3,5,1)))
Here is desired output:
group size int out id id2
1 A 1000 5.585529 0.000000 1 1
2 A 1000 5.709466 0.000000 1 1
3 A 1000 4.890697 0.000000 1 1
4 A 1000 0.000000 0.000000 1 1
5 A 1000 0.000000 0.000000 1 1
6 A 0 0.000000 4.080678 1 1
7 A 0 0.000000 4.883752 NA 1
8 A 0 0.000000 6.817312 NA 1
9 A 1000 4.546503 0.000000 2 2
10 A 1000 5.605887 0.000000 2 2
11 A 1000 3.182044 0.000000 2 2
12 A 1000 0.000000 0.000000 2 2
13 A 1000 0.000000 0.000000 2 2
14 A 1000 0.000000 0.000000 2 2
15 A 1000 0.000000 0.000000 2 2
16 A 0 0.000000 5.370628 2 2
17 A 0 0.000000 5.520216 NA 2
18 A 0 0.000000 4.249468 NA 2
19 A 1000 5.630099 0.000000 3 3
20 A 1000 4.723816 0.000000 3 3
21 A 1000 4.715840 0.000000 3 3
22 A 1000 0.000000 0.000000 3 3
23 A 1000 0.000000 0.000000 3 3
24 A 0 0.000000 5.816900 3 3
25 A 0 0.000000 4.113642 NA 3
26 A 0 0.000000 4.668422 NA 3
The new group id
is created based on the data above. I believe rle
function is the way to go, but I cannot figure it out to the end.
Upvotes: 0
Views: 66
Reputation: 66819
A variation on @ycw's answer:
library(data.table)
setDT(df)
df[, g := rleid( z <- out==0 | shift(out==0) )*NA^(!z) ]
group size int out g
1: A 1000 5.585529 0.000000 1
2: A 1000 5.709466 0.000000 1
3: A 1000 4.890697 0.000000 1
4: A 1000 0.000000 0.000000 1
5: A 1000 0.000000 0.000000 1
6: A 0 0.000000 4.080678 1
7: A 0 0.000000 4.883752 NA
8: A 0 0.000000 6.817312 NA
9: A 2000 4.546503 0.000000 3
10: A 2000 5.605887 0.000000 3
11: A 2000 3.182044 0.000000 3
12: A 2000 0.000000 0.000000 3
13: A 2000 0.000000 0.000000 3
14: A 2000 0.000000 0.000000 3
15: A 2000 0.000000 0.000000 3
16: A 0 0.000000 5.370628 3
17: A 0 0.000000 5.520216 NA
18: A 0 0.000000 4.249468 NA
19: A 5000 5.630099 0.000000 5
20: A 5000 4.723816 0.000000 5
21: A 5000 4.715840 0.000000 5
22: A 5000 0.000000 0.000000 5
23: A 5000 0.000000 0.000000 5
24: A 0 0.000000 5.816900 5
25: A 0 0.000000 4.113642 NA
26: A 0 0.000000 4.668422 NA
group size int out g
(@ycw suggested I make it a separate answer. Also, the NA^x
trick is borrowed from @akrun.)
For the OP's group numbers, this extra step works:
df[, g := match(g, unique(na.omit(g)))]
For the extension the OP added ("id2"):
w = df[.(unique(na.omit(g))), on=.(g), which=TRUE, mult="first"]
df[, g2 := cumsum(.I %in% w)]
So in the end we have...
group size int out g g2
1: A 1000 5.585529 0.000000 1 1
2: A 1000 5.709466 0.000000 1 1
3: A 1000 4.890697 0.000000 1 1
4: A 1000 0.000000 0.000000 1 1
5: A 1000 0.000000 0.000000 1 1
6: A 0 0.000000 4.080678 1 1
7: A 0 0.000000 4.883752 NA 1
8: A 0 0.000000 6.817312 NA 1
9: A 2000 4.546503 0.000000 2 2
10: A 2000 5.605887 0.000000 2 2
11: A 2000 3.182044 0.000000 2 2
12: A 2000 0.000000 0.000000 2 2
13: A 2000 0.000000 0.000000 2 2
14: A 2000 0.000000 0.000000 2 2
15: A 2000 0.000000 0.000000 2 2
16: A 0 0.000000 5.370628 2 2
17: A 0 0.000000 5.520216 NA 2
18: A 0 0.000000 4.249468 NA 2
19: A 5000 5.630099 0.000000 3 3
20: A 5000 4.723816 0.000000 3 3
21: A 5000 4.715840 0.000000 3 3
22: A 5000 0.000000 0.000000 3 3
23: A 5000 0.000000 0.000000 3 3
24: A 0 0.000000 5.816900 3 3
25: A 0 0.000000 4.113642 NA 3
26: A 0 0.000000 4.668422 NA 3
group size int out g g2
For base R analogues, there is an SO Q&A on how to make rleid
without data.table; shift
can be constructed manually (it's just a lag operator); and there are other ways to find w
(maybe tapply
?).
Upvotes: 3
Reputation: 39174
Here is an option using dplyr
and the rleid
function from the data.table
package. dt2
is the final output.
library(dplyr)
library(data.table)
df2 <- df %>%
mutate(non_zero = ifelse(size != 0, 1, 0)) %>%
mutate(runID = rleid(non_zero)) %>%
mutate(runID = ifelse(runID %% 2 != 0, (runID + 1)/2, runID/2)) %>%
group_by(runID) %>%
mutate(id = ifelse(row_number() %in% n():(n() - 1), NA, runID)) %>%
ungroup() %>%
select(group, size, int, out, id, id2 = runID)
Upvotes: 2