Reputation: 1278
I am working with data that looks like this:
ID Year Variable_of_Interest
1 a 2000 0
2 a 2001 0
3 a 2002 0
4 a 2003 0
5 a 2004 0
6 a 2005 1
7 a 2006 1
8 a 2007 1
9 a 2008 1
10 a 2009 1
11 b 2000 0
12 b 2001 0
13 b 2002 0
14 b 2003 1
15 b 2004 1
16 b 2005 1
17 b 2006 0
18 b 2007 1
19 b 2008 1
20 b 2009 1
21 c 2000 0
22 c 2001 0
23 c 2002 0
24 c 2003 0
25 c 2004 0
26 c 2005 0
27 c 2006 1
28 c 2007 1
29 c 2008 1
30 c 2009 0
31 d 2000 0
32 d 2001 0
33 d 2002 1
34 d 2003 1
35 d 2004 0
36 d 2005 1
37 d 2006 1
38 d 2007 1
39 d 2008 1
40 d 2009 1
The unit of analysis is in the ID
column. The IDs repeat across each year in the data.
The variable of interest
column represents changes to the IDs, wherein some years the values are a 0 and other years they are a 1.
I want to create an additional column that includes sequences of numbers that document the time before and after a code changes (defined as going from 0 to 1) in the Variable_of_Interest
at the year before and after the change.
The code must account for repeat code changes (defined as going from 0 to 1), such as ID b from 2002-2003 and 2006-2007.
NAs can be assigned to 0 values without changes back to 1, such as 0 in "c" 2009.
Such that, the data looks like:
ID Year Variable_of_Interest Solution
1 a 2000 0 -5
2 a 2001 0 -4
3 a 2002 0 -3
4 a 2003 0 -2
5 a 2004 0 -1
6 a 2005 1 0
7 a 2006 1 1
8 a 2007 1 2
9 a 2008 1 3
10 a 2009 1 4
11 b 2000 0 -3
12 b 2001 0 -2
13 b 2002 0 -1
14 b 2003 1 0
15 b 2004 1 1
16 b 2005 1 2
17 b 2006 0 -1
18 b 2007 1 0
19 b 2008 1 1
20 b 2009 1 2
21 c 2000 0 -6
22 c 2001 0 -5
23 c 2002 0 -4
24 c 2003 0 -3
25 c 2004 0 -2
26 c 2005 0 -1
27 c 2006 1 0
28 c 2007 1 1
29 c 2008 1 2
30 c 2009 0 NA
31 d 2000 0 -2
32 d 2001 0 -1
33 d 2002 1 0
34 d 2003 1 1
35 d 2004 0 -1
36 d 2005 1 0
37 d 2006 1 1
38 d 2007 1 2
39 d 2008 1 3
40 d 2009 1 4
Here is the replication code:
ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5),
rep(0,3), rep(1, 3), rep(0, 1), rep(1, 3),
rep(0,6), rep(1, 3), rep(0, 1),
rep(0,2), rep(1, 2), rep(0,1), rep(1,5))
data.frame(ID, Year, Variable_of_Interest)
Thank you so much for your help!
Upvotes: 1
Views: 105
Reputation: 25225
Another option using data.table
:
#identify runs
setDT(DF)[, ri := rleid(voi)]
#generate the desired output depending on whether VOI is 1 or 0
DF[, soln := if (voi[1L]==1L) seq.int(.N) - 1L else -rev(seq.int(.N)), .(ID, ri)]
#replace trailing 0 with NA
DF[, soln := if(voi[.N]==0L) replace(soln, ri==ri[.N], NA_integer_) else soln, ID]
data:
ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5),
rep(0,3), rep(1, 3), rep(0, 1), rep(1, 3),
rep(0,6), rep(1, 3), rep(0, 1),
rep(0,2), rep(1, 2), rep(0,1), rep(1,5))
DF <- data.frame(ID, Year, voi=Variable_of_Interest)
Upvotes: 3
Reputation: 14764
Here is a dplyr
option:
library(dplyr)
df %>%
group_by(ID, idx = cumsum(Variable_of_Interest != lag(Variable_of_Interest, default = first(Variable_of_Interest)))) %>%
mutate(Solution = case_when(Variable_of_Interest == 0 ~ rev(-1:(-n())), Variable_of_Interest == 1 ~ 0:(n() - 1))) %>%
group_by(ID) %>%
mutate(Solution = replace(Solution, idx == max(idx) & Variable_of_Interest == 0, NA)) %>%
select(-idx)
Output:
ID Year Variable_of_Interest idx Solution
1 a 2000 0 0 -5
2 a 2001 0 0 -4
3 a 2002 0 0 -3
4 a 2003 0 0 -2
5 a 2004 0 0 -1
6 a 2005 1 1 0
7 a 2006 1 1 1
8 a 2007 1 1 2
9 a 2008 1 1 3
10 a 2009 1 1 4
11 b 2000 0 2 -3
12 b 2001 0 2 -2
13 b 2002 0 2 -1
14 b 2003 1 3 0
15 b 2004 1 3 1
16 b 2005 1 3 2
17 b 2006 0 4 -1
18 b 2007 1 5 0
19 b 2008 1 5 1
20 b 2009 1 5 2
21 c 2000 0 6 -6
22 c 2001 0 6 -5
23 c 2002 0 6 -4
24 c 2003 0 6 -3
25 c 2004 0 6 -2
26 c 2005 0 6 -1
27 c 2006 1 7 0
28 c 2007 1 7 1
29 c 2008 1 7 2
30 c 2009 0 8 NA
31 d 2000 0 8 -2
32 d 2001 0 8 -1
33 d 2002 1 9 0
34 d 2003 1 9 1
35 d 2004 0 10 -1
36 d 2005 1 11 0
37 d 2006 1 11 1
38 d 2007 1 11 2
39 d 2008 1 11 3
40 d 2009 1 11 4
Upvotes: 2