Sharif Amlani
Sharif Amlani

Reputation: 1278

Creating sequence indicators corresponding to data timing in R

I am working with data that looks like this:

   ID Year Variable_of_Interest
1   a 2000                    0
2   a 2001                    0
3   a 2002                    0
4   a 2003                    0
5   a 2004                    0
6   a 2005                    1
7   a 2006                    1
8   a 2007                    1
9   a 2008                    1
10  a 2009                    1
11  b 2000                    0
12  b 2001                    0
13  b 2002                    0
14  b 2003                    1
15  b 2004                    1
16  b 2005                    1
17  b 2006                    0
18  b 2007                    1
19  b 2008                    1
20  b 2009                    1
21  c 2000                    0
22  c 2001                    0
23  c 2002                    0
24  c 2003                    0
25  c 2004                    0
26  c 2005                    0
27  c 2006                    1
28  c 2007                    1
29  c 2008                    1
30  c 2009                    0
31  d 2000                    0
32  d 2001                    0
33  d 2002                    1
34  d 2003                    1
35  d 2004                    0
36  d 2005                    1
37  d 2006                    1
38  d 2007                    1
39  d 2008                    1
40  d 2009                    1

The unit of analysis is in the ID column. The IDs repeat across each year in the data.

The variable of interest column represents changes to the IDs, wherein some years the values are a 0 and other years they are a 1.

I want to create an additional column that includes sequences of numbers that document the time before and after a code changes (defined as going from 0 to 1) in the Variable_of_Interest at the year before and after the change.

The code must account for repeat code changes (defined as going from 0 to 1), such as ID b from 2002-2003 and 2006-2007.

NAs can be assigned to 0 values without changes back to 1, such as 0 in "c" 2009.

Such that, the data looks like:

   ID Year Variable_of_Interest Solution
1   a 2000                    0       -5
2   a 2001                    0       -4
3   a 2002                    0       -3
4   a 2003                    0       -2
5   a 2004                    0       -1
6   a 2005                    1        0
7   a 2006                    1        1
8   a 2007                    1        2
9   a 2008                    1        3
10  a 2009                    1        4
11  b 2000                    0       -3
12  b 2001                    0       -2
13  b 2002                    0       -1
14  b 2003                    1        0
15  b 2004                    1        1
16  b 2005                    1        2
17  b 2006                    0       -1
18  b 2007                    1        0
19  b 2008                    1        1
20  b 2009                    1        2
21  c 2000                    0       -6
22  c 2001                    0       -5
23  c 2002                    0       -4
24  c 2003                    0       -3
25  c 2004                    0       -2
26  c 2005                    0       -1
27  c 2006                    1        0
28  c 2007                    1        1
29  c 2008                    1        2
30  c 2009                    0       NA
31  d 2000                    0       -2
32  d 2001                    0       -1
33  d 2002                    1        0
34  d 2003                    1        1
35  d 2004                    0       -1
36  d 2005                    1        0
37  d 2006                    1        1
38  d 2007                    1        2
39  d 2008                    1        3
40  d 2009                    1        4

Here is the replication code:

ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5), 
                         rep(0,3), rep(1, 3), rep(0, 1), rep(1, 3),
                         rep(0,6), rep(1, 3), rep(0, 1),
                         rep(0,2), rep(1, 2), rep(0,1), rep(1,5))


data.frame(ID, Year, Variable_of_Interest)

Thank you so much for your help!

Upvotes: 1

Views: 105

Answers (2)

chinsoon12
chinsoon12

Reputation: 25225

Another option using data.table:

#identify runs
setDT(DF)[, ri := rleid(voi)]

#generate the desired output depending on whether VOI is 1 or 0
DF[, soln := if (voi[1L]==1L) seq.int(.N) - 1L else -rev(seq.int(.N)), .(ID, ri)]

#replace trailing 0 with NA
DF[, soln := if(voi[.N]==0L) replace(soln, ri==ri[.N], NA_integer_) else soln, ID]

data:

ID <- c(rep("a",10), rep("b", 10), rep("c", 10), rep("d", 10)); length(ID)
Year <- rep(seq(2000,2009, 1), 4)
Variable_of_Interest <- c(rep(0,5), rep(1, 5), 
    rep(0,3), rep(1, 3), rep(0, 1), rep(1, 3),
    rep(0,6), rep(1, 3), rep(0, 1),
    rep(0,2), rep(1, 2), rep(0,1), rep(1,5))
DF <- data.frame(ID, Year, voi=Variable_of_Interest)

Upvotes: 3

arg0naut91
arg0naut91

Reputation: 14764

Here is a dplyr option:

library(dplyr)

df %>%
  group_by(ID, idx = cumsum(Variable_of_Interest != lag(Variable_of_Interest, default = first(Variable_of_Interest)))) %>%
  mutate(Solution = case_when(Variable_of_Interest == 0 ~ rev(-1:(-n())), Variable_of_Interest == 1 ~ 0:(n() - 1))) %>%
  group_by(ID) %>%
  mutate(Solution = replace(Solution, idx == max(idx) & Variable_of_Interest == 0, NA)) %>%
  select(-idx)

Output:

   ID Year Variable_of_Interest idx Solution
1   a 2000                    0   0       -5
2   a 2001                    0   0       -4
3   a 2002                    0   0       -3
4   a 2003                    0   0       -2
5   a 2004                    0   0       -1
6   a 2005                    1   1        0
7   a 2006                    1   1        1
8   a 2007                    1   1        2
9   a 2008                    1   1        3
10  a 2009                    1   1        4
11  b 2000                    0   2       -3
12  b 2001                    0   2       -2
13  b 2002                    0   2       -1
14  b 2003                    1   3        0
15  b 2004                    1   3        1
16  b 2005                    1   3        2
17  b 2006                    0   4       -1
18  b 2007                    1   5        0
19  b 2008                    1   5        1
20  b 2009                    1   5        2
21  c 2000                    0   6       -6
22  c 2001                    0   6       -5
23  c 2002                    0   6       -4
24  c 2003                    0   6       -3
25  c 2004                    0   6       -2
26  c 2005                    0   6       -1
27  c 2006                    1   7        0
28  c 2007                    1   7        1
29  c 2008                    1   7        2
30  c 2009                    0   8       NA
31  d 2000                    0   8       -2
32  d 2001                    0   8       -1
33  d 2002                    1   9        0
34  d 2003                    1   9        1
35  d 2004                    0  10       -1
36  d 2005                    1  11        0
37  d 2006                    1  11        1
38  d 2007                    1  11        2
39  d 2008                    1  11        3
40  d 2009                    1  11        4

Upvotes: 2

Related Questions