Reputation: 33

How to create new variables based on a binary variable in a grouped data set in R?

The data set has 3 columns-the 1st column is "id", the 2nd column is "year", and the 3rd column is "node". The 3rd column is a binary variable. Now, we need to fix the data errors on the column 3 "node" with the rule as follows.

1)Within each id, all values before the last value of node=1 should be equal to 1. There should be no node=0 occurring before a node=1. It should only stay at node=1 for all years or go from node=1 to node=0 at some point.

2)within each id, if all values of node from year1 to year8 equals (0 or 1), then we must keep them without any change.

To sum, the corrected data set should look like this,

 id     node   year
383100111   1   1
383100111   1   2
383100111   1   3
383100111   1   4
383100111   1   5
383100111   1   6
383100111   1   7
383100111   0   8
383100222   1   1
383100222   1   2
383100222   1   3
383100222   1   4
383100222   1   5
383100222   1   6
383100222   1   7
383100222   1   8
383100333   1   1
383100333   1   2
383100333   1   3
383100333   1   4
383100333   1   5
383100333   1   6
383100333   1   7
383100333   1   8
383100444   1   1
383100444   1   2
383100444   1   3
383100444   1   4
383100444   1   5
383100444   1   6
383100444   1   7
383100444   0   8
383100555   1   1
383100555   1   2
383100555   1   3
383100555   1   4
383100555   1   5
383100555   1   6
383100555   1   7
383100555   1   8
383100666   0   1
383100666   0   2
383100666   0   3
383100666   0   4
383100666   0   5
383100666   0   6
383100666   0   7
383100666   0   8
383100777   1   1
383100777   1   2
383100777   1   3
383100777   1   4
383100777   1   5
383100777   1   6
383100777   1   7
383100777   1   8

The original data set with errors is structured as follows,

structure(list(id = c(383100111, 383100111, 383100111, 383100111, 
383100111, 383100111, 383100111, 383100111, 383100222, 383100222, 
383100222, 383100222, 383100222, 383100222, 383100222, 383100222, 
383100333, 383100333, 383100333, 383100333, 383100333, 383100333, 
383100333, 383100333, 383100444, 383100444, 383100444, 383100444, 
383100444, 383100444, 383100444, 383100444, 383100555, 383100555, 
383100555, 383100555, 383100555, 383100555, 383100555, 383100555, 
383100666, 383100666, 383100666, 383100666, 383100666, 383100666, 
383100666, 383100666, 383100777, 383100777, 383100777, 383100777, 
383100777, 383100777, 383100777, 383100777), node = c(1, 1, 1, 
0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 
0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1), year = c(1, 2, 3, 4, 5, 6, 
7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 
4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 
1, 2, 3, 4, 5, 6, 7, 8)), row.names = c(NA, 56L), class ="data.frame")->dataframe

Thank you!

Upvotes: 0

Answers (2)

AnilGoyal

Reputation: 26238

Another strategy, which may be easier to understand

library(tidyverse)

df_repaired <- df %>% group_by(id) %>%
  mutate(dummy_id = row_number(),
         col1 = last(which(node ==1)),
         node2 = ifelse(dummy_id > col1 | is.na(col1), 0, 1)) %>%
  ungroup() %>%
  select(-node, -dummy_id, -col1) %>%
  rename(node = node2)

> print(df_repaired, n=56)
# A tibble: 56 x 3
          id  year  node
       <dbl> <dbl> <dbl>
 1 383100111     1     1
 2 383100111     2     1
 3 383100111     3     1
 4 383100111     4     1
 5 383100111     5     1
 6 383100111     6     1
 7 383100111     7     1
 8 383100111     8     0
 9 383100222     1     1
10 383100222     2     1
11 383100222     3     1
12 383100222     4     1
13 383100222     5     1
14 383100222     6     1
15 383100222     7     1
16 383100222     8     1
17 383100333     1     1
18 383100333     2     1
19 383100333     3     1
20 383100333     4     1
21 383100333     5     1
22 383100333     6     1
23 383100333     7     1
24 383100333     8     1
25 383100444     1     1
26 383100444     2     1
27 383100444     3     1
28 383100444     4     1
29 383100444     5     1
30 383100444     6     1
31 383100444     7     1
32 383100444     8     0
33 383100555     1     1
34 383100555     2     1
35 383100555     3     1
36 383100555     4     1
37 383100555     5     1
38 383100555     6     1
39 383100555     7     1
40 383100555     8     1
41 383100666     1     0
42 383100666     2     0
43 383100666     3     0
44 383100666     4     0
45 383100666     5     0
46 383100666     6     0
47 383100666     7     0
48 383100666     8     0
49 383100777     1     1
50 383100777     2     1
51 383100777     3     1
52 383100777     4     1
53 383100777     5     1
54 383100777     6     1
55 383100777     7     1
56 383100777     8     1

Upvotes: 0

AnilGoyal

Reputation: 26238

try this code

library(tidyverse)
df_repaired <- df %>% group_by(id) %>%
  mutate(col1 = cumsum(node)) %>%
  mutate(node2 = ifelse(last(col1)==col1 & node == 0, 0, 1)) %>%
  ungroup() %>%
  select(-col1, -node) %>%
  rename(node = node2)

> df_repaired
# A tibble: 56 x 3
          id  year  node
       <dbl> <dbl> <dbl>
 1 383100111     1     1
 2 383100111     2     1
 3 383100111     3     1
 4 383100111     4     1
 5 383100111     5     1
 6 383100111     6     1
 7 383100111     7     1
 8 383100111     8     0
 9 383100222     1     1
10 383100222     2     1
# ... with 46 more rows

Methodology explained:

Cumulative summation on field node. For this cumsum() has been used
checking when cumulative sum stopped increasing further. This was checked with ifelse conditional statement using last value in each group. In order to rule out the last 1 in node, this was also included as a condition with & operator.

To further explain the things let's take code without deleting dummy columns created..

df_repaired <- df %>% group_by(id) %>%
  mutate(col1 = cumsum(node)) %>%
  mutate(node2 = ifelse(last(col1)==col1 & node == 0, 0, 1)) %>%
  ungroup()

> print(df_repaired, n=56)
# A tibble: 56 x 5
          id  node  year  col1 node2
       <dbl> <dbl> <dbl> <dbl> <dbl>
 1 383100111     1     1     1     1
 2 383100111     1     2     2     1
 3 383100111     1     3     3     1
 4 383100111     0     4     3     1
 5 383100111     1     5     4     1
 6 383100111     0     6     4     1
 7 383100111     1     7     5     1
 8 383100111     0     8     5     0
 9 383100222     0     1     0     1
10 383100222     1     2     1     1
11 383100222     0     3     1     1
12 383100222     1     4     2     1
13 383100222     0     5     2     1
14 383100222     0     6     2     1
15 383100222     1     7     3     1
16 383100222     1     8     4     1
17 383100333     1     1     1     1
18 383100333     0     2     1     1
19 383100333     0     3     1     1
20 383100333     0     4     1     1
21 383100333     1     5     2     1
22 383100333     1     6     3     1
23 383100333     0     7     3     1
24 383100333     1     8     4     1
25 383100444     0     1     0     1
26 383100444     0     2     0     1
27 383100444     1     3     1     1
28 383100444     1     4     2     1
29 383100444     0     5     2     1
30 383100444     1     6     3     1
31 383100444     1     7     4     1
32 383100444     0     8     4     0
33 383100555     1     1     1     1
34 383100555     0     2     1     1
35 383100555     1     3     2     1
36 383100555     1     4     3     1
37 383100555     1     5     4     1
38 383100555     1     6     5     1
39 383100555     0     7     5     1
40 383100555     1     8     6     1
41 383100666     0     1     0     0
42 383100666     0     2     0     0
43 383100666     0     3     0     0
44 383100666     0     4     0     0
45 383100666     0     5     0     0
46 383100666     0     6     0     0
47 383100666     0     7     0     0
48 383100666     0     8     0     0
49 383100777     1     1     1     1
50 383100777     1     2     2     1
51 383100777     1     3     3     1
52 383100777     1     4     4     1
53 383100777     1     5     5     1
54 383100777     1     6     6     1
55 383100777     1     7     7     1
56 383100777     1     8     8     1

Upvotes: 1

How to create new variables based on a binary variable in a grouped data set in R?

Answers (2)

Related Questions