Emily Fassbender
Emily Fassbender

Reputation: 423

Filter duplicates by date and id

I have a data frame as shown below

Id      Date1        Date2        QuestionId   AnswerValue
10      2000-01-14   2000-01-14   1339         3
10      2000-01-14   1999-12-09   1339         2
10      2000-01-14   1999-11-23   1461         1 
10      2000-01-14   2000-01-03   1461         18
10      2000-01-14   1999-11-16   274          0
57      2014-02-01   2014-12-10   278          0
57      2014-02-01   2012-11-07   280          0
57      2014-02-01   2012-09-30   280          0 
57      2014-02-01   2012-01-15   261          0 

My goal is to retain observations based the following criteria.

  1. Retain rows for that ID and QuestionID combination where QuestionID is unique and there are no other duplicate QuestionIDs for that ID+QuestionID combination. Example : Retain the last row
Id      Date1        Date2        QuestionId   AnswerValue
57      2014-02-01   2012-01-15   261          0 
  1. If there are duplicate QuestionIDs for each QuestionID+ID combination, retain only rows with QuestionID where Date2 column value is closest to Date1 column value, for example: Id 10 has two QuestionID 1339. Based on this criteria only the 1st row should be retained because Date2 value, 2000-01-14 for Question ID 1339 is closest to Date1 value 2000-01-14 compared to the second row where the Date2 value is 1999-12-09 compared to Date1 value 2000-01-14.
Id      Date1        Date2        QuestionId   AnswerValue
10      2000-01-14   2000-01-14   1339         3
  1. Remove any rows where Date2 value is greater than Date1 For example 6th row should be deleted

    Id Date1 Date2 QuestionId AnswerValue 57 2014-02-01 2014-12-10 278 0

The final dataset would look like this below.

Id      Date1        Date2        QuestionId   AnswerValue
10      2000-01-14   2000-01-14   1339         3
10      2000-01-14   2000-01-03   1461         18
10      2000-01-14   1999-11-16   274          0
57      2014-02-01   2014-12-10   278          0
57      2014-02-01   2012-11-07   280          0
57      2014-02-01   2012-01-15   261          0 

Any help on achieving this is much appreciated. Thanks in advance.

Upvotes: 3

Views: 126

Answers (2)

Using data.table, first filter on your criterium 3 (Date2 <= Date1), then only return the row where Date1 - Date2 is smallest (i.e. the dates are closest to each other), and use by = .(Id, QuestionId) to do this for each unique combination of Id and QuestionId:

library(data.table)
dt <- structure(list(Id = c(10L, 10L, 10L, 10L, 10L, 57L, 57L, 57L, 
57L), Date1 = structure(c(10970, 10970, 10970, 10970, 10970, 
16102, 16102, 16102, 16102), class = "Date"), Date2 = structure(c(10970, 
10934, 10918, 10959, 10911, 16414, 15651, 15613, 15354), class = "Date"), 
    QuestionId = c(1339L, 1339L, 1461L, 1461L, 274L, 278L, 280L, 
    280L, 261L), AnswerValue = c(3L, 2L, 1L, 18L, 0L, 0L, 0L, 
    0L, 0L)), .Names = c("Id", "Date1", "Date2", "QuestionId", 
"AnswerValue"), row.names = c(NA, -9L), class = "data.frame")

setDT(dt)
dt[Date2 <= Date1, .SD[which.min(Date1-Date2)], by = .(Id, QuestionId)]

   Id QuestionId      Date1      Date2 AnswerValue
1: 10       1339 2000-01-14 2000-01-14           3
2: 10       1461 2000-01-14 2000-01-03          18
3: 10        274 2000-01-14 1999-11-16           0
4: 57        280 2014-02-01 2012-11-07           0
5: 57        261 2014-02-01 2012-01-15           0

Please note that in your final data example this row does not meet your third criteria (Date2 <= Date1):

Id           Date1      Date2     QuestionId   AnswerValue
57      2014-02-01   2014-12-10   278          0

Upvotes: 1

Sotos
Sotos

Reputation: 51592

Here is an idea using dplyr,

library(dplyr)

df %>% 
  group_by(Id, QuestionId) %>% 
  slice(which.min(difftime(Date1, Date2))) %>% 
  filter(Date2 <= Date1)

#Source: local data frame [5 x 5]
#Groups: Id, QuestionId [5]

#     Id      Date1      Date2 QuestionId AnswerValue
#  <int>     <date>     <date>      <int>       <int>
#1    10 2000-01-14 1999-11-16        274           0
#2    10 2000-01-14 2000-01-14       1339           3
#3    10 2000-01-14 2000-01-03       1461          18
#4    57 2014-02-01 2012-01-15        261           0
#5    57 2014-02-01 2012-11-07        280           0

NOTE

Make sure your Date columns are set as.Date

Upvotes: 4

Related Questions