afleishman
afleishman

Reputation: 433

Creating a new variable while using subsequent values in r

I have the following data frame:

df1 <- data.frame(id = rep(1:3, each = 5), 
                  time = rep(1:5),
                  y = c(rep(1, 4), 0, 1, 0, 1, 1, 0, 0, 1, rep(0,3)))

df1
##    id time y
## 1   1    1 1
## 2   1    2 1
## 3   1    3 1
## 4   1    4 1
## 5   1    5 0
## 6   2    1 1
## 7   2    2 0
## 8   2    3 1
## 9   2    4 1
## 10  2    5 0
## 11  3    1 0
## 12  3    2 1
## 13  3    3 0
## 14  3    4 0
## 15  3    5 0

I'd like to create a new indicator variable that tells me, for each of the three ids, at what point y = 0 for all subsequent responses. In the example above, for ids 1 and 2 this occurs at the 5th time point, and for id 3 this occurs at the 3rd time point.

I'm getting tripped up on id 2, where y = 1 at time point 2, but then goes back to one -- I'd like to the indicator variable to take subsequent time points into account.

Essentially, I'm looking for the following output:

df1
##    id time y new_col
## 1   1    1 1       0
## 2   1    2 1       0
## 3   1    3 1       0
## 4   1    4 1       0
## 5   1    5 0       1
## 6   2    1 1       0
## 7   2    2 0       0
## 8   2    3 1       0
## 9   2    4 1       0
## 10  2    5 0       1
## 11  3    1 0       0
## 12  3    2 1       0
## 13  3    3 0       1
## 14  3    4 0       1
## 15  3    5 0       1

The new_col variable is indicating whether or not y = 0 at that time point and for all subsequent time points.

Upvotes: 0

Views: 108

Answers (2)

akrun
akrun

Reputation: 886938

Here is an option using data.table

library(data.table)
setDT(df1)[,  indicator := cumsum(.I %in% .I[which.max(rleid(y)*!y)]), id]
df1
#    id time y indicator
# 1:  1    1 1         0
# 2:  1    2 1         0
# 3:  1    3 1         0
# 4:  1    4 1         0
# 5:  1    5 0         1
# 6:  2    1 1         0
# 7:  2    2 0         0
# 8:  2    3 1         0
# 9:  2    4 1         0
#10:  2    5 0         1
#11:  3    1 0         0
#12:  3    2 1         0
#13:  3    3 0         1
#14:  3    4 0         1
#15:  3    5 0         1

Based on the comments from @docendodiscimus, if the values are not 0 for 'y' at the end of each 'id', then we can do

setDT(df1)[, indicator := {
       i1 <- rleid(y) * !y
     if(i1[.N]!= max(i1) & !is.na(i1[.N])) 0L else cumsum(.I %in% .I[which.max(i1)])  }, id]

Upvotes: 0

talat
talat

Reputation: 70246

I would use a little helper function for that.

foo <- function(x, val) {
  pos <- max(which(x != val)) +1
  as.integer(seq_along(x) >= pos)
}

df1 %>% 
  group_by(id) %>% 
  mutate(indicator = foo(y, 0))

# # A tibble: 15 x 4
# # Groups:   id [3]
#     id  time     y indicator
#   <int> <int> <dbl>     <int>
# 1     1     1     1         0
# 2     1     2     1         0
# 3     1     3     1         0
# 4     1     4     1         0
# 5     1     5     0         1
# 6     2     1     1         0
# 7     2     2     0         0
# 8     2     3     1         0
# 9     2     4     1         0
# 10     2     5     0         1
# 11     3     1     0         0
# 12     3     2     1         0
# 13     3     3     0         1
# 14     3     4     0         1
# 15     3     5     0         1

In case you want to consider NA-values in y, you can adjust foo to:

foo <- function(x, val) {
  pos <- max(which(x != val | is.na(x))) +1
  as.integer(seq_along(x) >= pos)
}

That way, if there's a NA after the last y=0, the indicator will remain 0.

Upvotes: 2

Related Questions