Reputation: 211
Hers is a DF for test:
test_df <- structure(list(plant_id = c("plant_1", "plant_1", "plant_1", "plant_1", "plant_1",
"plant_2", "plant_2", "plant_2", "plant_2", "plant_2",
"plant_3", "plant_3", "plant_3", "plant_3", "plant_3",
"plant_4", "plant_4", "plant_4", "plant_4", "plant_4"),
skipped = c(1, 1, 0, 1, 2,
0, 1, 1, 0, 2,
1, 0, 1, 2, 2,
0, 0, 1, 1, 2)),
row.names = c(NA, -20L), class = "data.frame",
.Names = c("plant_sp", "skipped"))
As you can see, variable skipped has the values "0", "1" or "2". I need that per each plant_id (which is the group_by variable), that when its' first rows are "1" at skipped column, row will be deleted until a change at skipped column.
for example in my DF:
plant_sp skipped
1 plant_1 1
2 plant_1 1
3 plant_1 0
4 plant_1 1
5 plant_1 2
6 plant_2 0
7 plant_2 1
8 plant_2 1
9 plant_2 0
10 plant_2 2
11 plant_3 1
12 plant_3 0
13 plant_3 1
14 plant_3 2
15 plant_3 2
16 plant_4 0
17 plant_4 0
18 plant_4 1
19 plant_4 1
20 plant_4 2
TO:
plant_sp skipped
plant_sp skipped
3 plant_1 0
4 plant_1 1
5 plant_1 2
6 plant_2 0
7 plant_2 1
8 plant_2 1
9 plant_2 0
10 plant_2 2
12 plant_3 0
13 plant_3 1
14 plant_3 2
15 plant_3 2
16 plant_4 0
17 plant_4 0
18 plant_4 1
19 plant_4 1
20 plant_4 2
As you can see, since group "planet_1" and group "planet_2" started with "1", all rows with "1" at variable skipped at the begining were deleted (rows 1 and 2). All other rows stayed as they are.
if it is possible, a dplyr solution would be great, Thanks a lot!!!
Upvotes: 2
Views: 362
Reputation: 1525
Update: Original version didn't satisfy the query it has been updated now to remove only the rows including a one before a non-one entry. This matches the output presented in the query.
To try and complete this I have created some temporary rows to identify the first row which doesn't contain a one in each group and then removed all rows before this
library(tidyverse)
test_df <- structure(list(plant_id = c("plant_1", "plant_1", "plant_1", "plant_1", "plant_1",
"plant_2", "plant_2", "plant_2", "plant_2", "plant_2",
"plant_3", "plant_3", "plant_3", "plant_3", "plant_3",
"plant_4", "plant_4", "plant_4", "plant_4", "plant_4"),
skipped = c(1, 1, 0, 1, 2,
0, 1, 1, 0, 2,
1, 0, 1, 2, 2,
0, 0, 1, 1, 2)),
row.names = c(NA, -20L), class = "data.frame",
.Names = c("plant_sp", "skipped"))
test_df <- tibble(test_df)
first_positions_df<- test_df %>%
# group by each factor we want
group_by(plant_sp) %>%
# label the order of the rows
mutate(order = 1:length(skipped)) %>%
# mark position of rows which aren't a 1 otherwise set to infinity
mutate(notones = ifelse(skipped != 1, order, Inf)) %>%
# Find the first position which is not a 1
mutate(ignore = min(notones)) %>%
# Remove all ones before this row
filter(ignore <= order)
#Final result
first_positions_df %>%
# select only the useful columns
select(plant_sp, skipped)
#> # A tibble: 17 x 2
#> # Groups: plant_sp [4]
#> plant_sp skipped
#> <chr> <dbl>
#> 1 plant_1 0
#> 2 plant_1 1
#> 3 plant_1 2
#> 4 plant_2 0
#> 5 plant_2 1
#> 6 plant_2 1
#> 7 plant_2 0
#> 8 plant_2 2
#> 9 plant_3 0
#> 10 plant_3 1
#> 11 plant_3 2
#> 12 plant_3 2
#> 13 plant_4 0
#> 14 plant_4 0
#> 15 plant_4 1
#> 16 plant_4 1
#> 17 plant_4 2
Created on 2021-04-04 by the reprex package (v2.0.0)
Upvotes: 1
Reputation: 21918
It may not be the most efficient way of handling this problem but here is another way:
test_df %>%
group_by(plant_sp) %>%
mutate(id = row_number()) %>%
filter(!(skipped == 1 & id == 1)) %>%
mutate(id = row_number()) %>%
filter(!(skipped == 1 & id == 1)) %>%
select(-id)
# A tibble: 17 x 2
# Groups: plant_sp [4]
plant_sp skipped
<chr> <dbl>
1 plant_1 0
2 plant_1 1
3 plant_1 2
4 plant_2 0
5 plant_2 1
6 plant_2 1
7 plant_2 0
8 plant_2 2
9 plant_3 0
10 plant_3 1
11 plant_3 2
12 plant_3 2
13 plant_4 0
14 plant_4 0
15 plant_4 1
16 plant_4 1
17 plant_4 2
Upvotes: 1
Reputation: 78927
We could use the window function cummin
from dplyr
library(dplyr)
test_df %>%
group_by(plant_sp) %>%
mutate(cummin = cummin(skipped)) %>%
filter(cummin == 0) %>%
select(-cummin)
Output:
plant_sp skipped
<chr> <dbl>
1 plant_1 0
2 plant_1 1
3 plant_1 2
4 plant_2 0
5 plant_2 1
6 plant_2 1
7 plant_2 0
8 plant_2 2
9 plant_3 0
10 plant_3 1
11 plant_3 2
12 plant_3 2
13 plant_4 0
14 plant_4 0
15 plant_4 1
16 plant_4 1
17 plant_4 2
Upvotes: 1
Reputation: 432
As I understand it, you want to look at the row before, and see if there are any changes, and continue throughout the dataframe. I believe this can be done by looking at a lag and then filtering down to rows where the lagged value and current value differ. Assuming your dataframe is sorted as in your test, this should work:
test_df %>%
mutate(lag1 = coalesce(lag(plant_sp, 1), plant_sp),
lag2 = coalesce(lag(skipped, 1), skipped)) %>%
filter(plant_sp == lag1 & lag2 != skipped) %>%
select(plant_sp, skipped)
Upvotes: 0