Reputation: 433
I have the following data frame:
df1 <- data.frame(id = rep(1:3, each = 5),
time = rep(1:5),
y = c(rep(1, 4), 0, 1, 0, 1, 1, 0, 0, 1, rep(0,3)))
df1
## id time y
## 1 1 1 1
## 2 1 2 1
## 3 1 3 1
## 4 1 4 1
## 5 1 5 0
## 6 2 1 1
## 7 2 2 0
## 8 2 3 1
## 9 2 4 1
## 10 2 5 0
## 11 3 1 0
## 12 3 2 1
## 13 3 3 0
## 14 3 4 0
## 15 3 5 0
I'd like to create a new indicator variable that tells me, for each of the three ids, at what point y = 0 for all subsequent responses. In the example above, for ids 1 and 2 this occurs at the 5th time point, and for id 3 this occurs at the 3rd time point.
I'm getting tripped up on id 2, where y = 1 at time point 2, but then goes back to one -- I'd like to the indicator variable to take subsequent time points into account.
Essentially, I'm looking for the following output:
df1
## id time y new_col
## 1 1 1 1 0
## 2 1 2 1 0
## 3 1 3 1 0
## 4 1 4 1 0
## 5 1 5 0 1
## 6 2 1 1 0
## 7 2 2 0 0
## 8 2 3 1 0
## 9 2 4 1 0
## 10 2 5 0 1
## 11 3 1 0 0
## 12 3 2 1 0
## 13 3 3 0 1
## 14 3 4 0 1
## 15 3 5 0 1
The new_col variable is indicating whether or not y = 0 at that time point and for all subsequent time points.
Upvotes: 0
Views: 108
Reputation: 886938
Here is an option using data.table
library(data.table)
setDT(df1)[, indicator := cumsum(.I %in% .I[which.max(rleid(y)*!y)]), id]
df1
# id time y indicator
# 1: 1 1 1 0
# 2: 1 2 1 0
# 3: 1 3 1 0
# 4: 1 4 1 0
# 5: 1 5 0 1
# 6: 2 1 1 0
# 7: 2 2 0 0
# 8: 2 3 1 0
# 9: 2 4 1 0
#10: 2 5 0 1
#11: 3 1 0 0
#12: 3 2 1 0
#13: 3 3 0 1
#14: 3 4 0 1
#15: 3 5 0 1
Based on the comments from @docendodiscimus, if the values are not 0 for 'y' at the end of each 'id', then we can do
setDT(df1)[, indicator := {
i1 <- rleid(y) * !y
if(i1[.N]!= max(i1) & !is.na(i1[.N])) 0L else cumsum(.I %in% .I[which.max(i1)]) }, id]
Upvotes: 0
Reputation: 70246
I would use a little helper function for that.
foo <- function(x, val) {
pos <- max(which(x != val)) +1
as.integer(seq_along(x) >= pos)
}
df1 %>%
group_by(id) %>%
mutate(indicator = foo(y, 0))
# # A tibble: 15 x 4
# # Groups: id [3]
# id time y indicator
# <int> <int> <dbl> <int>
# 1 1 1 1 0
# 2 1 2 1 0
# 3 1 3 1 0
# 4 1 4 1 0
# 5 1 5 0 1
# 6 2 1 1 0
# 7 2 2 0 0
# 8 2 3 1 0
# 9 2 4 1 0
# 10 2 5 0 1
# 11 3 1 0 0
# 12 3 2 1 0
# 13 3 3 0 1
# 14 3 4 0 1
# 15 3 5 0 1
In case you want to consider NA-values in y, you can adjust foo to:
foo <- function(x, val) {
pos <- max(which(x != val | is.na(x))) +1
as.integer(seq_along(x) >= pos)
}
That way, if there's a NA after the last y=0, the indicator will remain 0.
Upvotes: 2