VincentP
VincentP

Reputation: 89

Remove rows based on condition and date difference between different events in R with dplyr

I expect to remove rows based on several condition and dates differences between different events, per ids.

my data look like this

#dat
id  name                    date
10  "BREEDING"              2019-05-17
10  "OTHER"                 2020-01-01
11  "BREEDING"              2020-07-01
11  "GESTATION POSITIF"     2020-09-01
12  "BREEDING"              2020-06-26
12  "GESTATION NEGATIF"     2020-08-01
21  "OTHER"                 2018-06-20
21  "GESTATION POSITIF"     2018-10-15
22  "BREEDING"              2020-08-07
22  "GESTATION POSITIF"     2020-09-11

what do I wish is, per ids, for those who got name "BREEDING" and "GESTATION" (doesn't matter if it is positive or negative), to compute the date difference between the events "GESTATION" and the events "BREEDING".

Then if the date difference is 34, 35, 36, to remove the row BREEDING who match this condition, within the object dat.

I got a loop solution which is doing the job, but I'd like to improve this. I made several trials but did not succeed. Basically I assume that the right code should be something "like" this

dat %>% 
group_by(id) %>% 
grepl("GESTATION",name) %>% #keep ids with an event gestation recorded
grepl("BREEDING",name) %>% #keep ids with an event breeding recorded
mutate(new_var = "date GESTATION"-"date BREEDING") %>%#per ids, compute de difference of date beetween events gestation event and breeding
filter(!(new_var %in% c(34:36))) %>% #if an event of breeding happened 35 days before the gestation event, +/- 1 day, remove 
ungroup()

finally my result should be, for this example,

#dat
id  name                    date
10  "BREEDING"              2019-05-17
10  "OTHER"                 2020-01-01
11  "BREEDING"              2020-07-01
11  "GESTATION POSITIF"     2020-09-01
12  "GESTATION NEGATIF"     2020-08-01
21  "OTHER"                 2018-06-20
21  "GESTATION POSITIF"     2018-10-15
22  "GESTATION POSITIF"     2020-09-11

Upvotes: 2

Views: 197

Answers (1)

juljo
juljo

Reputation: 674

This is also probably not the most clean solution but pivoting to wide format and then back to long works:

library(tidyverse)
library(lubridate)

dat %>%
  separate(name, into = c("name", "gest"), fill = "right") %>%
  pivot_wider(names_from = name, values_from = c(date, gest)) %>%
  mutate(date_BREEDING = if_else((date_GESTATION - date_BREEDING) %in% c(34, 35, 36), NA_Date_, date_BREEDING)) %>%
  pivot_longer(cols = c(date_BREEDING, date_OTHER, date_GESTATION), values_to = "date", values_drop_na = T) %>%
  select(-gest_BREEDING, -gest_OTHER) %>%
  mutate(name = str_sub(name, 6)) 

The output is:

     id gest_GESTATION name      date      
  <dbl> <chr>          <chr>     <date>    
1    10 NA             BREEDING  2019-05-17
2    10 NA             OTHER     2020-01-01
3    11 POSITIF        BREEDING  2020-07-01
4    11 POSITIF        GESTATION 2020-09-01
5    12 NEGATIF        GESTATION 2020-08-01
6    21 POSITIF        OTHER     2018-06-20
7    21 POSITIF        GESTATION 2018-10-15
8    22 POSITIF        GESTATION 2020-09-11

Which has the additional advantage of saving whether "GESTATION" is positive or negative in a separate variable. If you do not need that and want exactly the desired output specified in your question you can add:

%>%
  mutate(name = if_else(is.na(gest_GESTATION), name, str_c(name, gest_GESTATION, sep = " "))) %>%
  select(-gest_GESTATION)

Upvotes: 1

Related Questions