Reputation: 33
The data set has 3 columns-the 1st column is "id", the 2nd column is "year", and the 3rd column is "node". The 3rd column is a binary variable. Now, we need to fix the data errors on the column 3 "node" with the rule as follows.
1)Within each id, all values before the last value of node=1 should be equal to 1. There should be no node=0 occurring before a node=1. It should only stay at node=1 for all years or go from node=1 to node=0 at some point.
2)within each id, if all values of node from year1 to year8 equals (0 or 1), then we must keep them without any change.
To sum, the corrected data set should look like this,
id node year
383100111 1 1
383100111 1 2
383100111 1 3
383100111 1 4
383100111 1 5
383100111 1 6
383100111 1 7
383100111 0 8
383100222 1 1
383100222 1 2
383100222 1 3
383100222 1 4
383100222 1 5
383100222 1 6
383100222 1 7
383100222 1 8
383100333 1 1
383100333 1 2
383100333 1 3
383100333 1 4
383100333 1 5
383100333 1 6
383100333 1 7
383100333 1 8
383100444 1 1
383100444 1 2
383100444 1 3
383100444 1 4
383100444 1 5
383100444 1 6
383100444 1 7
383100444 0 8
383100555 1 1
383100555 1 2
383100555 1 3
383100555 1 4
383100555 1 5
383100555 1 6
383100555 1 7
383100555 1 8
383100666 0 1
383100666 0 2
383100666 0 3
383100666 0 4
383100666 0 5
383100666 0 6
383100666 0 7
383100666 0 8
383100777 1 1
383100777 1 2
383100777 1 3
383100777 1 4
383100777 1 5
383100777 1 6
383100777 1 7
383100777 1 8
The original data set with errors is structured as follows,
structure(list(id = c(383100111, 383100111, 383100111, 383100111,
383100111, 383100111, 383100111, 383100111, 383100222, 383100222,
383100222, 383100222, 383100222, 383100222, 383100222, 383100222,
383100333, 383100333, 383100333, 383100333, 383100333, 383100333,
383100333, 383100333, 383100444, 383100444, 383100444, 383100444,
383100444, 383100444, 383100444, 383100444, 383100555, 383100555,
383100555, 383100555, 383100555, 383100555, 383100555, 383100555,
383100666, 383100666, 383100666, 383100666, 383100666, 383100666,
383100666, 383100666, 383100777, 383100777, 383100777, 383100777,
383100777, 383100777, 383100777, 383100777), node = c(1, 1, 1,
0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1,
0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1), year = c(1, 2, 3, 4, 5, 6,
7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3,
4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8,
1, 2, 3, 4, 5, 6, 7, 8)), row.names = c(NA, 56L), class ="data.frame")->dataframe
Thank you!
Upvotes: 0
Views: 187
Reputation: 26218
Another strategy, which may be easier to understand
library(tidyverse)
df_repaired <- df %>% group_by(id) %>%
mutate(dummy_id = row_number(),
col1 = last(which(node ==1)),
node2 = ifelse(dummy_id > col1 | is.na(col1), 0, 1)) %>%
ungroup() %>%
select(-node, -dummy_id, -col1) %>%
rename(node = node2)
> print(df_repaired, n=56)
# A tibble: 56 x 3
id year node
<dbl> <dbl> <dbl>
1 383100111 1 1
2 383100111 2 1
3 383100111 3 1
4 383100111 4 1
5 383100111 5 1
6 383100111 6 1
7 383100111 7 1
8 383100111 8 0
9 383100222 1 1
10 383100222 2 1
11 383100222 3 1
12 383100222 4 1
13 383100222 5 1
14 383100222 6 1
15 383100222 7 1
16 383100222 8 1
17 383100333 1 1
18 383100333 2 1
19 383100333 3 1
20 383100333 4 1
21 383100333 5 1
22 383100333 6 1
23 383100333 7 1
24 383100333 8 1
25 383100444 1 1
26 383100444 2 1
27 383100444 3 1
28 383100444 4 1
29 383100444 5 1
30 383100444 6 1
31 383100444 7 1
32 383100444 8 0
33 383100555 1 1
34 383100555 2 1
35 383100555 3 1
36 383100555 4 1
37 383100555 5 1
38 383100555 6 1
39 383100555 7 1
40 383100555 8 1
41 383100666 1 0
42 383100666 2 0
43 383100666 3 0
44 383100666 4 0
45 383100666 5 0
46 383100666 6 0
47 383100666 7 0
48 383100666 8 0
49 383100777 1 1
50 383100777 2 1
51 383100777 3 1
52 383100777 4 1
53 383100777 5 1
54 383100777 6 1
55 383100777 7 1
56 383100777 8 1
Upvotes: 0
Reputation: 26218
try this code
library(tidyverse)
df_repaired <- df %>% group_by(id) %>%
mutate(col1 = cumsum(node)) %>%
mutate(node2 = ifelse(last(col1)==col1 & node == 0, 0, 1)) %>%
ungroup() %>%
select(-col1, -node) %>%
rename(node = node2)
> df_repaired
# A tibble: 56 x 3
id year node
<dbl> <dbl> <dbl>
1 383100111 1 1
2 383100111 2 1
3 383100111 3 1
4 383100111 4 1
5 383100111 5 1
6 383100111 6 1
7 383100111 7 1
8 383100111 8 0
9 383100222 1 1
10 383100222 2 1
# ... with 46 more rows
Methodology explained:
node
. For this cumsum()
has been usedifelse
conditional statement using last
value in each group. In order to rule out the last 1 in node, this was also included as a condition with &
operator.To further explain the things let's take code without deleting dummy columns created..
df_repaired <- df %>% group_by(id) %>%
mutate(col1 = cumsum(node)) %>%
mutate(node2 = ifelse(last(col1)==col1 & node == 0, 0, 1)) %>%
ungroup()
> print(df_repaired, n=56)
# A tibble: 56 x 5
id node year col1 node2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 383100111 1 1 1 1
2 383100111 1 2 2 1
3 383100111 1 3 3 1
4 383100111 0 4 3 1
5 383100111 1 5 4 1
6 383100111 0 6 4 1
7 383100111 1 7 5 1
8 383100111 0 8 5 0
9 383100222 0 1 0 1
10 383100222 1 2 1 1
11 383100222 0 3 1 1
12 383100222 1 4 2 1
13 383100222 0 5 2 1
14 383100222 0 6 2 1
15 383100222 1 7 3 1
16 383100222 1 8 4 1
17 383100333 1 1 1 1
18 383100333 0 2 1 1
19 383100333 0 3 1 1
20 383100333 0 4 1 1
21 383100333 1 5 2 1
22 383100333 1 6 3 1
23 383100333 0 7 3 1
24 383100333 1 8 4 1
25 383100444 0 1 0 1
26 383100444 0 2 0 1
27 383100444 1 3 1 1
28 383100444 1 4 2 1
29 383100444 0 5 2 1
30 383100444 1 6 3 1
31 383100444 1 7 4 1
32 383100444 0 8 4 0
33 383100555 1 1 1 1
34 383100555 0 2 1 1
35 383100555 1 3 2 1
36 383100555 1 4 3 1
37 383100555 1 5 4 1
38 383100555 1 6 5 1
39 383100555 0 7 5 1
40 383100555 1 8 6 1
41 383100666 0 1 0 0
42 383100666 0 2 0 0
43 383100666 0 3 0 0
44 383100666 0 4 0 0
45 383100666 0 5 0 0
46 383100666 0 6 0 0
47 383100666 0 7 0 0
48 383100666 0 8 0 0
49 383100777 1 1 1 1
50 383100777 1 2 2 1
51 383100777 1 3 3 1
52 383100777 1 4 4 1
53 383100777 1 5 5 1
54 383100777 1 6 6 1
55 383100777 1 7 7 1
56 383100777 1 8 8 1
Upvotes: 1