Reputation: 159
I have panel data that looks like below.
pid year type occurrence
10001 2012 A 1
10001 2012 A 2
10001 2013 B 5
10001 2013 B 3
10001 2013 B 2
10001 2014 A 1
10001 2014 A 5
10001 2014 A 3
10002 2012 B 2
10002 2012 B 1
10002 2012 B 6
10002 2013 B 3
10002 2013 B 2
10002 2013 B 7
10003 2012 B 2
10003 2012 B 1
10003 2012 B 6
10003 2014 B 3
10003 2014 B 2
10003 2015 B 2
10003 2015 B 5
10003 2015 B 2
10003 2016 B 3
10003 2016 B 6
I would like to create a variable that indicates whether for each pid, the type has been changed throughout the years such that
pid year type occurrence typechange
10001 2012 A 1 NA
10001 2012 A 2 NA
10001 2013 B 5 "A->B"
10001 2013 B 3 "A->B"
10001 2013 B 2 "A->B"
10001 2014 A 1 "B->A"
10001 2014 A 5 "B->A"
10001 2014 A 3 "B->A"
I tried loop through all pids and years to keep track of changes, but quickly realized that the code runs too slow for my little laptop. (dataset about 200,000 entries)
I decided to create lag variable for type
a type.lag1
, then compare two to create the typechange
variable, but confronted a wall.
The expected lag variable should look like this:
pid year type occurrence type.lag1
10001 2012 A 1 NA
10001 2012 A 2 NA
10001 2013 B 5 A
10001 2013 B 3 A
10001 2013 B 2 A
10001 2014 A 1 B
10001 2014 A 5 B
10001 2014 A 3 B
10002 2012 B 2 NA
10002 2012 B 1 NA
10002 2012 B 6 NA
10002 2013 B 3 B
10002 2013 B 2 B
10002 2013 B 7 B
10003 2012 B 2 NA
10003 2012 B 1 NA
10003 2012 B 6 NA
10003 2014 A 3 B
10003 2014 A 2 B
10003 2015 B 2 A
10003 2015 B 5 A
10003 2015 B 2 A
10003 2016 C 3 B
10003 2016 C 6 B
The lagged variable should be such that for each pid, if type is A in 2012 , then in the following year 2013 (if 2013 is missing, then in 2014 or forth), lag should have type A for all pid.
Many of the time, 3 pid appear in each year, but it is not guaranteed. There are cases where some of years are missing too. Then the lag should get values from previous(may not be consecutive) year.
If there are some method other than creating a lag variable to create typechange
, it would be very helpful to hear.
Thank you very much in advance!
Upvotes: 0
Views: 138
Reputation: 388907
Here is one way using lag
from dplyr
and tidyr::fill
library(dplyr)
df %>%
arrange(pid, year) %>%
group_by(pid) %>%
mutate(type.lag1 = replace(lag(type), year == lag(year, default = first(year)),
NA)) %>%
tidyr::fill(type.lag1)
# pid year type occurrence type.lag1
#1 10001 2012 A 1 <NA>
#2 10001 2012 A 2 <NA>
#3 10001 2013 B 5 A
#4 10001 2013 B 3 A
#5 10001 2013 B 2 A
#6 10001 2014 A 1 B
#7 10001 2014 A 5 B
#8 10001 2014 A 3 B
#9 10002 2012 B 2 <NA>
#10 10002 2012 B 1 <NA>
#11 10002 2012 B 6 <NA>
#12 10002 2013 B 3 B
#13 10002 2013 B 2 B
#14 10002 2013 B 7 B
#15 10003 2012 B 2 <NA>
#16 10003 2012 B 1 <NA>
#17 10003 2012 B 6 <NA>
#18 10003 2014 A 3 B
#19 10003 2014 A 2 B
#20 10003 2015 B 2 A
#21 10003 2015 B 5 A
#22 10003 2015 B 2 A
#23 10003 2016 C 3 B
#24 10003 2016 C 6 B
data
df <- structure(list(pid = c(10001L, 10001L, 10001L, 10001L, 10001L,
10001L, 10001L, 10001L, 10002L, 10002L, 10002L, 10002L, 10002L,
10002L, 10003L, 10003L, 10003L, 10003L, 10003L, 10003L, 10003L,
10003L, 10003L, 10003L), year = c(2012L, 2012L, 2013L, 2013L,
2013L, 2014L, 2014L, 2014L, 2012L, 2012L, 2012L, 2013L, 2013L,
2013L, 2012L, 2012L, 2012L, 2014L, 2014L, 2015L, 2015L, 2015L,
2016L, 2016L), type = structure(c(1L, 1L, 2L, 2L, 2L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 3L,
3L), .Label = c("A", "B", "C"), class = "factor"), occurrence = c(1L,
2L, 5L, 3L, 2L, 1L, 5L, 3L, 2L, 1L, 6L, 3L, 2L, 7L, 2L, 1L, 6L,
3L, 2L, 2L, 5L, 2L, 3L, 6L)), row.names = c(NA, -24L), class = "data.frame")
Upvotes: 2