Reputation: 59
I’m working with a dataset of 191K observations that has individual records with date stamps throughout the period of 2014-2020.
I have four reference dates (19 Sept 2014, 9 Sept 2016, 26 Oct 2017, 19 June 2019) that I need to determine the difference between each record’s date stamp. The rub is that I only need positive values for difftime
:
Sample data:
Date Difftime Notes:
11 Nov 2014 53 19 Sept 2014 used as reference
10 Jun 2015 180 19 Sept 2014 reference, but >180
5 Jan 2018 71 26 Oct 2017 reference
1 May 2019 NA No reference date within 365 days
In summary, a record’s date stamp needs to be compared to a relevant reference date (i.e., the closet, post-record date). I think I can do this in a number of individual ifelse
statements after creating separate variables for each of the reference dates, but I don’t want to clutter my dataset with more variables (even if I can remove them post-assessment). I’d appreciate any insights into how to code for this assessment.
Upvotes: 1
Views: 142
Reputation: 30474
Here is one possible approach.
I created a vector with the four reference dates (in order):
ref_dates <- as.Date(c("2014-09-19", "2016-09-09", "2017-10-26", "2019-6-19"))
and a data.frame with sample data:
sample_data <- data.frame(Date = as.Date(c("2014-11-11", "2015-6-10", "2018-1-5", "2019-5-1")))
You can use findInterval
to identify the closest reference date for each sample date, without going over ("price-is-right" approach):
Edit: I created a function to return the closest date (or NA
if it precedes all of the reference dates):
my_fun <- function(x) {
the_date <- ref_dates[findInterval(x, ref_dates)]
return(ifelse(length(the_date) == 0, NA, the_date))
}
The function would be called as follows, including origin
as date becomes numeric:
sample_data$Ref_Date <- as.Date(sapply(sample_data$Date, my_fun), origin = '1970-01-01')
That would give you this:
Date Ref_Date
1 2014-11-11 2014-09-19
2 2015-06-10 2014-09-19
3 2018-01-05 2017-10-26
4 2019-05-01 2017-10-26
You can calculate difference between the two dates (in this case stored as integer):
sample_data$Difftime <- as.integer(round(difftime(sample_data$Date, sample_data$Ref_Date, units = "days")))
Then, you can apply your rules. Options could include case_when
(from dplyr
) or newer fcase
from latest data.table
(it is fast).
library(data.table)
setDT(sample_data)
sample_data[ , Difftime_final := fcase(
Difftime < 180L, Difftime,
Difftime < 365L, 180L,
Difftime >= 365L, NA_integer_
)]
Which gives the final result:
Date Ref_Date Difftime Difftime_final
1: 2014-11-11 2014-09-19 53 53
2: 2015-06-10 2014-09-19 264 180
3: 2018-01-05 2017-10-26 71 71
4: 2019-05-01 2017-10-26 552 NA
Upvotes: 1