Lili
Lili

Reputation: 587

Eliminate duplicates based on 2 dates in a dataframe

I have this sample dataframe:

df <- data.frame(ID = c("5","5","5","5","5","5" ,"5"    ,"5","5","5","5","14","14","14","14" ,"14","14"),
                 Date1= c("22/07/2014","22/07/2014","22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                           ,"22/07/2014"
                          ,"22/07/2014"
                          ,"08/11/2016" 
                         , "08/11/2016"
                         , "08/11/2016"
                         , "08/11/2016"
                         , "08/11/2016"
                         , "08/11/2016"),
                 Date2= c("01/01/2011"
                          ,"01/08/2011"
                          ,"01/12/2010"
                          ,"10/11/2015"
                          ,"22/07/2014"
                          ,"01/01/2013"
                          ,"23/04/2014"
                          ,"01/01/2006"
                          ,"01/01/2013"
                          ,"01/10/2012"
                          ,"01/08/2012"
                          ,"14/04/2015"
                          ,"01/10/2008"
                          ,"01/10/2008"
                          ,"14/05/2015"
                          ,"11/04/2015"
                          ,"05/10/2008"),
stringsAsFactors = F)

where I have each ID repeated several times. I need to get a dataframe with only 1 line per ID. As you can see, each patient has only one date in column df$date1, so the condition to select 1 row per patient would be: to pick the closest date between date 1 and date 2.

How could I do that?

Thansk

Upvotes: 2

Views: 59

Answers (3)

akrun
akrun

Reputation: 886998

Using base R, convert the 'Date' columns to Date class, order the data based on the 'ID' and the absolute difference between the Date columns, subset with duplicated i.e. first unique row on 'ID' column

df[2:3] <- lapply(df[2:3], as.Date, format = "%d/%m/%Y")
df1 <- df[with(df, order(ID, abs(as.numeric(Date1) - as.numeric(Date2)))),]
df1[!duplicated(df1$ID),]

-output

ID      Date1      Date2
15 14 2016-11-08 2015-05-14
5   5 2014-07-22 2014-07-22

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 101159

Try the following base R code

unique(
  subset(
    df,
    !!ave(
      abs(as.integer(as.Date(Date2, format = "%d/%m/%Y") - as.Date(Date1, format = "%d/%m/%Y"))),
      ID,
      FUN = function(x) x == min(x)
    )
  )
)

and you will get

   ID      Date1      Date2
5   5 22/07/2014 22/07/2014
15 14 08/11/2016 14/05/2015

Upvotes: 2

Vin&#237;cius F&#233;lix
Vin&#237;cius F&#233;lix

Reputation: 8811

Here a tidyverse approach. I created a column called diff_date, which is the absolute difference between Date1 and Date2. Than I filtered by each ID the minimum difference.

library(dplyr)
library(lubridate)
  
  df %>% 
  mutate(
    across(.cols = starts_with("Date"),.fns = dmy),
    diff_date = abs(as.numeric(difftime(Date1,Date2)))
    ) %>% 
  group_by(ID) %>% 
  filter(diff_date == min(diff_date))

# A tibble: 2 x 4
# Groups:   ID [2]
  ID    Date1      Date2      diff_date
  <chr> <date>     <date>         <dbl>
1 5     2014-07-22 2014-07-22         0
2 14    2016-11-08 2015-05-14  47001600

Upvotes: 3

Related Questions