Ido
Ido

Reputation: 211

if_else for deleting first row in each group - dplyr

Hers is a DF for test:

test_df <- structure(list(plant_id = c("plant_1", "plant_1", "plant_1", "plant_1", "plant_1",
                                       "plant_2", "plant_2", "plant_2", "plant_2", "plant_2", 
                                       "plant_3", "plant_3", "plant_3", "plant_3", "plant_3",
                                       "plant_4", "plant_4", "plant_4", "plant_4", "plant_4"), 
                          skipped = c(1, 1, 0, 1, 2, 
                                      0, 1, 1, 0, 2,
                                      1, 0, 1, 2, 2, 
                                      0, 0, 1, 1, 2)), 
                     row.names = c(NA, -20L), class = "data.frame", 
                     .Names = c("plant_sp", "skipped"))

As you can see, variable skipped has the values "0", "1" or "2". I need that per each plant_id (which is the group_by variable), that when its' first rows are "1" at skipped column, row will be deleted until a change at skipped column.

for example in my DF:

   plant_sp skipped
1   plant_1       1
2   plant_1       1
3   plant_1       0
4   plant_1       1
5   plant_1       2
6   plant_2       0
7   plant_2       1
8   plant_2       1
9   plant_2       0
10  plant_2       2
11  plant_3       1
12  plant_3       0
13  plant_3       1
14  plant_3       2
15  plant_3       2
16  plant_4       0
17  plant_4       0
18  plant_4       1
19  plant_4       1
20  plant_4       2

TO:

   plant_sp skipped
   plant_sp skipped

3   plant_1       0
4   plant_1       1
5   plant_1       2
6   plant_2       0
7   plant_2       1
8   plant_2       1
9   plant_2       0
10  plant_2       2
12  plant_3       0
13  plant_3       1
14  plant_3       2
15  plant_3       2
16  plant_4       0
17  plant_4       0
18  plant_4       1
19  plant_4       1
20  plant_4       2

As you can see, since group "planet_1" and group "planet_2" started with "1", all rows with "1" at variable skipped at the begining were deleted (rows 1 and 2). All other rows stayed as they are.

if it is possible, a dplyr solution would be great, Thanks a lot!!!

Upvotes: 2

Views: 362

Answers (4)

Joel Kandiah
Joel Kandiah

Reputation: 1525

Update: Original version didn't satisfy the query it has been updated now to remove only the rows including a one before a non-one entry. This matches the output presented in the query.

To try and complete this I have created some temporary rows to identify the first row which doesn't contain a one in each group and then removed all rows before this

library(tidyverse)

test_df <- structure(list(plant_id = c("plant_1", "plant_1", "plant_1", "plant_1", "plant_1",
                                       "plant_2", "plant_2", "plant_2", "plant_2", "plant_2", 
                                       "plant_3", "plant_3", "plant_3", "plant_3", "plant_3",
                                       "plant_4", "plant_4", "plant_4", "plant_4", "plant_4"), 
                          skipped = c(1, 1, 0, 1, 2, 
                                      0, 1, 1, 0, 2,
                                      1, 0, 1, 2, 2, 
                                      0, 0, 1, 1, 2)), 
                     row.names = c(NA, -20L), class = "data.frame", 
                     .Names = c("plant_sp", "skipped"))

test_df <- tibble(test_df)

first_positions_df<- test_df %>% 
  # group by each factor we want
  group_by(plant_sp) %>%
  # label the order of the rows
  mutate(order = 1:length(skipped)) %>% 
  # mark position of rows which aren't a 1 otherwise set to infinity
  mutate(notones = ifelse(skipped != 1, order, Inf)) %>% 
  # Find the first position which is not a 1
  mutate(ignore = min(notones)) %>% 
  # Remove all ones before this row
  filter(ignore <= order)

#Final result
first_positions_df %>% 
  # select only the useful columns
  select(plant_sp, skipped)
#> # A tibble: 17 x 2
#> # Groups:   plant_sp [4]
#>    plant_sp skipped
#>    <chr>      <dbl>
#>  1 plant_1        0
#>  2 plant_1        1
#>  3 plant_1        2
#>  4 plant_2        0
#>  5 plant_2        1
#>  6 plant_2        1
#>  7 plant_2        0
#>  8 plant_2        2
#>  9 plant_3        0
#> 10 plant_3        1
#> 11 plant_3        2
#> 12 plant_3        2
#> 13 plant_4        0
#> 14 plant_4        0
#> 15 plant_4        1
#> 16 plant_4        1
#> 17 plant_4        2

Created on 2021-04-04 by the reprex package (v2.0.0)

Upvotes: 1

Anoushiravan R
Anoushiravan R

Reputation: 21918

It may not be the most efficient way of handling this problem but here is another way:

test_df %>%
  group_by(plant_sp) %>%
  mutate(id = row_number()) %>%
  filter(!(skipped == 1 & id == 1)) %>%
  mutate(id = row_number()) %>%
  filter(!(skipped == 1 & id == 1)) %>%
  select(-id)

# A tibble: 17 x 2
# Groups:   plant_sp [4]
   plant_sp skipped
   <chr>      <dbl>
 1 plant_1        0
 2 plant_1        1
 3 plant_1        2
 4 plant_2        0
 5 plant_2        1
 6 plant_2        1
 7 plant_2        0
 8 plant_2        2
 9 plant_3        0
10 plant_3        1
11 plant_3        2
12 plant_3        2
13 plant_4        0
14 plant_4        0
15 plant_4        1
16 plant_4        1
17 plant_4        2

Upvotes: 1

TarJae
TarJae

Reputation: 78927

We could use the window function cummin from dplyr

library(dplyr)
test_df %>%  
  group_by(plant_sp) %>% 
  mutate(cummin = cummin(skipped)) %>% 
  filter(cummin == 0) %>% 
  select(-cummin)

Output:

   plant_sp skipped
   <chr>      <dbl>
 1 plant_1        0
 2 plant_1        1
 3 plant_1        2
 4 plant_2        0
 5 plant_2        1
 6 plant_2        1
 7 plant_2        0
 8 plant_2        2
 9 plant_3        0
10 plant_3        1
11 plant_3        2
12 plant_3        2
13 plant_4        0
14 plant_4        0
15 plant_4        1
16 plant_4        1
17 plant_4        2

Upvotes: 1

Carey Caginalp
Carey Caginalp

Reputation: 432

As I understand it, you want to look at the row before, and see if there are any changes, and continue throughout the dataframe. I believe this can be done by looking at a lag and then filtering down to rows where the lagged value and current value differ. Assuming your dataframe is sorted as in your test, this should work:

test_df %>% 
  mutate(lag1 = coalesce(lag(plant_sp, 1), plant_sp),
         lag2 = coalesce(lag(skipped, 1), skipped)) %>% 
  filter(plant_sp == lag1 & lag2 != skipped) %>% 
  select(plant_sp, skipped)

Upvotes: 0

Related Questions