Josh Cho
Josh Cho

Reputation: 159

How to generate lag variable by year with repeated pid occurrences

I have panel data that looks like below.

pid   year type  occurrence 
10001 2012 A     1
10001 2012 A     2
10001 2013 B     5
10001 2013 B     3
10001 2013 B     2
10001 2014 A     1
10001 2014 A     5
10001 2014 A     3
10002 2012 B     2
10002 2012 B     1
10002 2012 B     6
10002 2013 B     3
10002 2013 B     2
10002 2013 B     7
10003 2012 B     2
10003 2012 B     1
10003 2012 B     6
10003 2014 B     3
10003 2014 B     2
10003 2015 B     2
10003 2015 B     5
10003 2015 B     2
10003 2016 B     3
10003 2016 B     6

I would like to create a variable that indicates whether for each pid, the type has been changed throughout the years such that

pid   year type  occurrence typechange
10001 2012 A     1          NA
10001 2012 A     2          NA
10001 2013 B     5          "A->B"
10001 2013 B     3          "A->B"
10001 2013 B     2          "A->B"
10001 2014 A     1          "B->A"
10001 2014 A     5          "B->A"
10001 2014 A     3          "B->A"

I tried loop through all pids and years to keep track of changes, but quickly realized that the code runs too slow for my little laptop. (dataset about 200,000 entries)

I decided to create lag variable for type a type.lag1, then compare two to create the typechange variable, but confronted a wall.

The expected lag variable should look like this:

pid   year type  occurrence type.lag1 
10001 2012 A     1          NA
10001 2012 A     2          NA
10001 2013 B     5          A
10001 2013 B     3          A
10001 2013 B     2          A
10001 2014 A     1          B
10001 2014 A     5          B
10001 2014 A     3          B
10002 2012 B     2          NA
10002 2012 B     1          NA
10002 2012 B     6          NA
10002 2013 B     3          B
10002 2013 B     2          B
10002 2013 B     7          B
10003 2012 B     2          NA
10003 2012 B     1          NA
10003 2012 B     6          NA
10003 2014 A     3          B
10003 2014 A     2          B
10003 2015 B     2          A
10003 2015 B     5          A
10003 2015 B     2          A
10003 2016 C     3          B
10003 2016 C     6          B

The lagged variable should be such that for each pid, if type is A in 2012 , then in the following year 2013 (if 2013 is missing, then in 2014 or forth), lag should have type A for all pid.

Many of the time, 3 pid appear in each year, but it is not guaranteed. There are cases where some of years are missing too. Then the lag should get values from previous(may not be consecutive) year.

If there are some method other than creating a lag variable to create typechange, it would be very helpful to hear.

Thank you very much in advance!

Upvotes: 0

Views: 138

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388907

Here is one way using lag from dplyr and tidyr::fill

library(dplyr)
df %>%
  arrange(pid, year) %>%
  group_by(pid) %>%
  mutate(type.lag1 = replace(lag(type), year == lag(year, default = first(year)), 
                             NA)) %>%
  tidyr::fill(type.lag1) 

#     pid year type occurrence type.lag1
#1  10001 2012    A          1      <NA>
#2  10001 2012    A          2      <NA>
#3  10001 2013    B          5         A
#4  10001 2013    B          3         A
#5  10001 2013    B          2         A
#6  10001 2014    A          1         B
#7  10001 2014    A          5         B
#8  10001 2014    A          3         B
#9  10002 2012    B          2      <NA>
#10 10002 2012    B          1      <NA>
#11 10002 2012    B          6      <NA>
#12 10002 2013    B          3         B
#13 10002 2013    B          2         B
#14 10002 2013    B          7         B
#15 10003 2012    B          2      <NA>
#16 10003 2012    B          1      <NA>
#17 10003 2012    B          6      <NA>
#18 10003 2014    A          3         B
#19 10003 2014    A          2         B
#20 10003 2015    B          2         A
#21 10003 2015    B          5         A
#22 10003 2015    B          2         A
#23 10003 2016    C          3         B
#24 10003 2016    C          6         B

data

df <- structure(list(pid = c(10001L, 10001L, 10001L, 10001L, 10001L, 
10001L, 10001L, 10001L, 10002L, 10002L, 10002L, 10002L, 10002L, 
10002L, 10003L, 10003L, 10003L, 10003L, 10003L, 10003L, 10003L, 
10003L, 10003L, 10003L), year = c(2012L, 2012L, 2013L, 2013L, 
2013L, 2014L, 2014L, 2014L, 2012L, 2012L, 2012L, 2013L, 2013L, 
2013L, 2012L, 2012L, 2012L, 2014L, 2014L, 2015L, 2015L, 2015L, 
2016L, 2016L), type = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 3L, 
3L), .Label = c("A", "B", "C"), class = "factor"), occurrence = c(1L, 
2L, 5L, 3L, 2L, 1L, 5L, 3L, 2L, 1L, 6L, 3L, 2L, 7L, 2L, 1L, 6L, 
3L, 2L, 2L, 5L, 2L, 3L, 6L)), row.names = c(NA, -24L), class = "data.frame")

Upvotes: 2

Related Questions