Reputation: 423
I have a data frame as shown below
Id Date1 Date2 QuestionId AnswerValue
10 2000-01-14 2000-01-14 1339 3
10 2000-01-14 1999-12-09 1339 2
10 2000-01-14 1999-11-23 1461 1
10 2000-01-14 2000-01-03 1461 18
10 2000-01-14 1999-11-16 274 0
57 2014-02-01 2014-12-10 278 0
57 2014-02-01 2012-11-07 280 0
57 2014-02-01 2012-09-30 280 0
57 2014-02-01 2012-01-15 261 0
My goal is to retain observations based the following criteria.
Id Date1 Date2 QuestionId AnswerValue
57 2014-02-01 2012-01-15 261 0
Id Date1 Date2 QuestionId AnswerValue
10 2000-01-14 2000-01-14 1339 3
Remove any rows where Date2 value is greater than Date1 For example 6th row should be deleted
Id Date1 Date2 QuestionId AnswerValue 57 2014-02-01 2014-12-10 278 0
The final dataset would look like this below.
Id Date1 Date2 QuestionId AnswerValue
10 2000-01-14 2000-01-14 1339 3
10 2000-01-14 2000-01-03 1461 18
10 2000-01-14 1999-11-16 274 0
57 2014-02-01 2014-12-10 278 0
57 2014-02-01 2012-11-07 280 0
57 2014-02-01 2012-01-15 261 0
Any help on achieving this is much appreciated. Thanks in advance.
Upvotes: 3
Views: 126
Reputation: 1475
Using data.table
, first filter on your criterium 3 (Date2 <= Date1
), then only return the row where Date1 - Date2
is smallest (i.e. the dates are closest to each other), and use by = .(Id, QuestionId)
to do this for each unique combination of Id
and QuestionId
:
library(data.table)
dt <- structure(list(Id = c(10L, 10L, 10L, 10L, 10L, 57L, 57L, 57L,
57L), Date1 = structure(c(10970, 10970, 10970, 10970, 10970,
16102, 16102, 16102, 16102), class = "Date"), Date2 = structure(c(10970,
10934, 10918, 10959, 10911, 16414, 15651, 15613, 15354), class = "Date"),
QuestionId = c(1339L, 1339L, 1461L, 1461L, 274L, 278L, 280L,
280L, 261L), AnswerValue = c(3L, 2L, 1L, 18L, 0L, 0L, 0L,
0L, 0L)), .Names = c("Id", "Date1", "Date2", "QuestionId",
"AnswerValue"), row.names = c(NA, -9L), class = "data.frame")
setDT(dt)
dt[Date2 <= Date1, .SD[which.min(Date1-Date2)], by = .(Id, QuestionId)]
Id QuestionId Date1 Date2 AnswerValue
1: 10 1339 2000-01-14 2000-01-14 3
2: 10 1461 2000-01-14 2000-01-03 18
3: 10 274 2000-01-14 1999-11-16 0
4: 57 280 2014-02-01 2012-11-07 0
5: 57 261 2014-02-01 2012-01-15 0
Please note that in your final data example this row does not meet your third criteria (Date2 <= Date1
):
Id Date1 Date2 QuestionId AnswerValue
57 2014-02-01 2014-12-10 278 0
Upvotes: 1
Reputation: 51592
Here is an idea using dplyr
,
library(dplyr)
df %>%
group_by(Id, QuestionId) %>%
slice(which.min(difftime(Date1, Date2))) %>%
filter(Date2 <= Date1)
#Source: local data frame [5 x 5]
#Groups: Id, QuestionId [5]
# Id Date1 Date2 QuestionId AnswerValue
# <int> <date> <date> <int> <int>
#1 10 2000-01-14 1999-11-16 274 0
#2 10 2000-01-14 2000-01-14 1339 3
#3 10 2000-01-14 2000-01-03 1461 18
#4 57 2014-02-01 2012-01-15 261 0
#5 57 2014-02-01 2012-11-07 280 0
NOTE
Make sure your Date columns are set as.Date
Upvotes: 4