Creating sequence indicators corresponding to data timing in R

Question

I am working with data that looks like this:

   ID Year Variable_of_Interest
1   a 2000                    0
2   a 2001                    0
3   a 2002                    0
4   a 2003                    0
5   a 2004                    0
6   a 2005                    1
7   a 2006                    1
8   a 2007                    1
9   a 2008                    1
10  a 2009                    1
11  b 2000                    0
12  b 2001                    0
13  b 2002                    0
14  b 2003                    1
15  b 2004                    1
16  b 2005                    1
17  b 2006                    0
18  b 2007                    1
19  b 2008                    1
20  b 2009                    1
21  c 2000                    0
22  c 2001                    0
23  c 2002                    0
24  c 2003                    0
25  c 2004                    0
26  c 2005                    0
27  c 2006                    1
28  c 2007                    1
29  c 2008                    1
30  c 2009                    0
31  d 2000                    0
32  d 2001                    0
33  d 2002                    1
34  d 2003                    1
35  d 2004                    0
36  d 2005                    1
37  d 2006                    1
38  d 2007                    1
39  d 2008                    1
40  d 2009                    1

The unit of analysis is in the ID column. The IDs repeat across each year in the data.

The variable of interest column represents changes to the IDs, wherein some years the values are a 0 and other years they are a 1.

I want to create an additional column that includes sequences of numbers that document the time before and after a code changes (defined as going from 0 to 1) in the Variable_of_Interest at the year before and after the change.

The code must account for repeat code changes (defined as going from 0 to 1), such as ID b from 2002-2003 and 2006-2007.

NAs can be assigned to 0 values without changes back to 1, such as 0 in "c" 2009.

Such that, the data looks like:

   ID Year Variable_of_Interest Solution
1   a 2000                    0       -5
2   a 2001                    0       -4
3   a 2002                    0       -3
4   a 2003                    0       -2
5   a 2004                    0       -1
6   a 2005                    1        0
7   a 2006                    1        1
8   a 2007                    1        2
9   a 2008                    1        3
10  a 2009                    1        4
11  b 2000                    0       -3
12  b 2001                    0       -2
13  b 2002                    0       -1
14  b 2003                    1        0
15  b 2004                    1        1
16  b 2005                    1        2
17  b 2006                    0       -1
18  b 2007                    1        0
19  b 2008                    1        1
20  b 2009                    1        2
21  c 2000                    0       -6
22  c 2001                    0       -5
23  c 2002                    0       -4
24  c 2003                    0       -3
25  c 2004                    0       -2
26  c 2005                    0       -1
27  c 2006                    1        0
28  c 2007                    1        1
29  c 2008                    1        2
30  c 2009                    0       NA
31  d 2000                    0       -2
32  d 2001                    0       -1
33  d 2002                    1        0
34  d 2003                    1        1
35  d 2004                    0       -1
36  d 2005                    1        0
37  d 2006                    1        1
38  d 2007                    1        2
39  d 2008                    1        3
40  d 2009                    1        4

Here is the replication code:

ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5), 
                         rep(0,3), rep(1, 3), rep(0, 1), rep(1, 3),
                         rep(0,6), rep(1, 3), rep(0, 1),
                         rep(0,2), rep(1, 2), rep(0,1), rep(1,5))


data.frame(ID, Year, Variable_of_Interest)

Thank you so much for your help!

arg0naut91 · Accepted Answer

Here is a dplyr option:

library(dplyr)

df %>%
  group_by(ID, idx = cumsum(Variable_of_Interest != lag(Variable_of_Interest, default = first(Variable_of_Interest)))) %>%
  mutate(Solution = case_when(Variable_of_Interest == 0 ~ rev(-1:(-n())), Variable_of_Interest == 1 ~ 0:(n() - 1))) %>%
  group_by(ID) %>%
  mutate(Solution = replace(Solution, idx == max(idx) & Variable_of_Interest == 0, NA)) %>%
  select(-idx)

Output:

   ID Year Variable_of_Interest idx Solution
1   a 2000                    0   0       -5
2   a 2001                    0   0       -4
3   a 2002                    0   0       -3
4   a 2003                    0   0       -2
5   a 2004                    0   0       -1
6   a 2005                    1   1        0
7   a 2006                    1   1        1
8   a 2007                    1   1        2
9   a 2008                    1   1        3
10  a 2009                    1   1        4
11  b 2000                    0   2       -3
12  b 2001                    0   2       -2
13  b 2002                    0   2       -1
14  b 2003                    1   3        0
15  b 2004                    1   3        1
16  b 2005                    1   3        2
17  b 2006                    0   4       -1
18  b 2007                    1   5        0
19  b 2008                    1   5        1
20  b 2009                    1   5        2
21  c 2000                    0   6       -6
22  c 2001                    0   6       -5
23  c 2002                    0   6       -4
24  c 2003                    0   6       -3
25  c 2004                    0   6       -2
26  c 2005                    0   6       -1
27  c 2006                    1   7        0
28  c 2007                    1   7        1
29  c 2008                    1   7        2
30  c 2009                    0   8       NA
31  d 2000                    0   8       -2
32  d 2001                    0   8       -1
33  d 2002                    1   9        0
34  d 2003                    1   9        1
35  d 2004                    0  10       -1
36  d 2005                    1  11        0
37  d 2006                    1  11        1
38  d 2007                    1  11        2
39  d 2008                    1  11        3
40  d 2009                    1  11        4

Creating sequence indicators corresponding to data timing in R

Answers (2)

Related Questions