Lili
Lili

Reputation: 587

Eliminate duplicates based on conditions from several columns in R

This is my dataset:

df <- data.frame(PatientID = c("3454","3454","3454","345","345","345"), date = c("05/01/2001", "02/06/1997", "29/03/2004", "05/2/2021", "01/06/1960", "29/03/2003"),
                 infarct1 = c(TRUE, NA, TRUE, NA, NA, TRUE),infarct2 = c(TRUE, TRUE, TRUE, TRUE, NA, TRUE,  stringsAsFactors = F)

Basically I need to keep just 1 patient ID (aka, eliminate duplicated PatientID), based on the most recent infarct (last infarct==TRUE [but any kind of infarct] based on date).

So the outcome I want would look like:

 df <- data.frame(PatientID = c("3454","345"), date = c("29/03/2004", "05/2/2021"),
                     infarct = c(TRUE,TRUE), stringsAsFactors = F)

Hope this makes sense.

Thanks

Upvotes: 0

Views: 53

Answers (2)

Marcos P&#233;rez
Marcos P&#233;rez

Reputation: 1250

Try this:

library(dplyr)

df <- df %>% 
  mutate(infarct = infarct1 | infarct2) %>%
  filter(infarct == TRUE) %>%
  group_by(PatientID, infarct) %>%
  summarise(date=max(date))
  1. Create infarct variable.
  2. Filter TRUE infarct.
  3. Group.
  4. Look for last time.

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 389055

You can turn the date to date class, arrange the data by PatientID and date and get the last date where infarct = TRUE.

library(dplyr)

df %>%
  mutate(date = lubridate::dmy(date)) %>%
  arrange(PatientID, date) %>%
  group_by(PatientID) %>%
  summarise(date = date[max(which(infarct))], 
            infract = TRUE)

#  PatientID date       infract
#  <chr>     <date>     <lgl>  
#1 345       2003-03-29 TRUE   
#2 3454      2004-03-29 TRUE   

For multiple columns get the data in long format.

df %>%
  mutate(date = lubridate::dmy(date)) %>%
  tidyr::pivot_longer(cols = starts_with('infarct')) %>%
  arrange(PatientID, date) %>%
  group_by(PatientID) %>%
  slice(max(which(value))) %>%
  ungroup

#  PatientID date       name     value
#  <chr>     <date>     <chr>    <lgl>
#1 345       2021-02-05 infarct2 TRUE 
#2 3454      2004-03-29 infarct2 TRUE 

data

I think you need quotes around data in date column.

df <- data.frame(PatientID = c("3454","3454","3454","345","345","345"), 
                 date = c("05/01/2001", "02/06/1997", "29/03/2004", "05/2/2021", "01/06/1960", "29/03/2003"),
                 infarct = c(TRUE, NA, TRUE, NA, NA, TRUE), stringsAsFactors = FALSE)

Upvotes: 1

Related Questions