Reputation: 587
This is my dataset:
df <- data.frame(PatientID = c("3454","3454","3454","345","345","345"), date = c("05/01/2001", "02/06/1997", "29/03/2004", "05/2/2021", "01/06/1960", "29/03/2003"),
infarct1 = c(TRUE, NA, TRUE, NA, NA, TRUE),infarct2 = c(TRUE, TRUE, TRUE, TRUE, NA, TRUE, stringsAsFactors = F)
Basically I need to keep just 1 patient ID (aka, eliminate duplicated PatientID
), based on the most recent infarct (last infarct==TRUE
[but any kind of infarct] based on date
).
So the outcome I want would look like:
df <- data.frame(PatientID = c("3454","345"), date = c("29/03/2004", "05/2/2021"),
infarct = c(TRUE,TRUE), stringsAsFactors = F)
Hope this makes sense.
Thanks
Upvotes: 0
Views: 53
Reputation: 1250
Try this:
library(dplyr)
df <- df %>%
mutate(infarct = infarct1 | infarct2) %>%
filter(infarct == TRUE) %>%
group_by(PatientID, infarct) %>%
summarise(date=max(date))
infarct
variable.Upvotes: 2
Reputation: 389055
You can turn the date to date class, arrange
the data by PatientID
and date
and get the last date where infarct = TRUE
.
library(dplyr)
df %>%
mutate(date = lubridate::dmy(date)) %>%
arrange(PatientID, date) %>%
group_by(PatientID) %>%
summarise(date = date[max(which(infarct))],
infract = TRUE)
# PatientID date infract
# <chr> <date> <lgl>
#1 345 2003-03-29 TRUE
#2 3454 2004-03-29 TRUE
For multiple columns get the data in long format.
df %>%
mutate(date = lubridate::dmy(date)) %>%
tidyr::pivot_longer(cols = starts_with('infarct')) %>%
arrange(PatientID, date) %>%
group_by(PatientID) %>%
slice(max(which(value))) %>%
ungroup
# PatientID date name value
# <chr> <date> <chr> <lgl>
#1 345 2021-02-05 infarct2 TRUE
#2 3454 2004-03-29 infarct2 TRUE
data
I think you need quotes around data in date
column.
df <- data.frame(PatientID = c("3454","3454","3454","345","345","345"),
date = c("05/01/2001", "02/06/1997", "29/03/2004", "05/2/2021", "01/06/1960", "29/03/2003"),
infarct = c(TRUE, NA, TRUE, NA, NA, TRUE), stringsAsFactors = FALSE)
Upvotes: 1