Tim Welsh
Tim Welsh

Reputation: 15

Create new count variable conditional on previous values within group

I have a dataframe with group and time-id. Now I want to create a new count variable, called X2, conditional on previous values of X1 within each group.

Suppose I have the following dataframe, variables group, time, X1, and want to create X2. The value of X2 should be a count variable indicating the number of periods (i.e., rows) since X1 has last been equal to 1 within the given group. If all previous values of X1 are 0, X2 should be missing.

          group     time       X1        X2 
1          1         1         0         NA  
2          1         2         0         NA  
3          1         3         1         NA  
4          1         4         0         0  
5          1         5         1         1
6          2         1         0         NA  
7          2         2         1         NA  
8          2         3         1         0  
9          2         4         1         0  
10         2         5         0         0  
11         3         1         0         NA  
12         3         2         0         NA  
13         3         3         0         NA  
14         3         4         1         NA  
15         3         5         0         0  
16         4         1         1         NA  
17         4         2         0         0  
18         4         3         0         1  
19         4         4         0         2  
20         4         5         1         3    
data_test <- data.frame(
   group = rep(1:4, each = 5), 
   time = rep(1:5, n = 4), 
   X1 = c(0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1), 
   X2 = c(NA, NA, NA, 0, 1, NA, NA, 0, 0, 0, NA, NA, NA, NA, 0, NA, 0, 1, 2, 3))

I could easily create the NAs by using an if command with cumsum()==0. However, I have no idea how to approach the other part, i.e., create the count.

Any help would be much appreciated - if at all possible using dplyr.

Thanks a lot.

Upvotes: 0

Views: 267

Answers (2)

Ben
Ben

Reputation: 30474

This isn't very elegant, but it incorporates the rules provided I think.

A column is added for the row numbers where X1 is 1, and fill used so you can use the most recent row number available for that condition.

To calculate X2_new, if X1 is 0, it is the difference between the current row number within the group minus the last row number where X1 was 1. If X1 is 1 (and not the first X1 of 1 in the group), then do similar calculation but on the previous rn reference.

library(tidyverse)

data_test %>%
  group_by(group) %>%
  mutate(rn = ifelse(X1 == 1, row_number(), NA)) %>%
  fill(rn) %>%
  mutate(X2_new = ifelse(X1 == 0 & row_number() > rn, row_number() - rn - 1, NA),
         X2_new = ifelse(X1 == 1 & !is.na(lag(rn)), row_number() - lag(rn) - 1, X2_new)) %>%
  select(-rn)

Output

# A tibble: 20 x 5
# Groups:   group [4]
   group  time    X1    X2 X2_new
   <int> <int> <dbl> <dbl>  <dbl>
 1     1     1     0    NA     NA
 2     1     2     0    NA     NA
 3     1     3     1    NA     NA
 4     1     4     0     0      0
 5     1     5     1     1      1
 6     2     1     0    NA     NA
 7     2     2     1    NA     NA
 8     2     3     1     0      0
 9     2     4     1     0      0
10     2     5     0     0      0
11     3     1     0    NA     NA
12     3     2     0    NA     NA
13     3     3     0    NA     NA
14     3     4     1    NA     NA
15     3     5     0     0      0
16     4     1     1    NA     NA
17     4     2     0     0      0
18     4     3     0     1      1
19     4     4     0     2      2
20     4     5     1     3      3

Upvotes: 1

bbakes
bbakes

Reputation: 1

Using the pipe (%>%), mutate, group_by, and lag functions from the dplyr package

library(dplyr)
data_test %>%
  group_by(group) %>%
  mutate(X2 = ifelse(lag(X1) == 1, 0, lag(X2) + 1))

This does assume the difference in time will always be 1 as shown in your sample data. If time varies in your data and X2 should reflect the difference in time:

data_test %>%
  group_by(group) %>%
  mutate(X2 = ifelse(lag(X1) == 1, 0, lag(X2) + (time - lag(time))))

Upvotes: 0

Related Questions