Determine differences between date stamps and reference dates

Question

I’m working with a dataset of 191K observations that has individual records with date stamps throughout the period of 2014-2020.

I have four reference dates (19 Sept 2014, 9 Sept 2016, 26 Oct 2017, 19 June 2019) that I need to determine the difference between each record’s date stamp. The rub is that I only need positive values for difftime:

if the difference is <180 then the difference will be used;
if the difference between dates is 180-365 days, then the value will be set to 180;
if the difference >365 days, then NA;
no negative values will be included.

Sample data:

Date            Difftime     Notes:
11 Nov 2014     53           19 Sept 2014 used as reference
10 Jun 2015     180          19 Sept 2014 reference, but >180
5 Jan 2018      71           26 Oct 2017 reference
1 May 2019      NA           No reference date within 365 days

In summary, a record’s date stamp needs to be compared to a relevant reference date (i.e., the closet, post-record date). I think I can do this in a number of individual ifelse statements after creating separate variables for each of the reference dates, but I don’t want to clutter my dataset with more variables (even if I can remove them post-assessment). I’d appreciate any insights into how to code for this assessment.

Ben · Accepted Answer

Here is one possible approach.

I created a vector with the four reference dates (in order):

ref_dates <- as.Date(c("2014-09-19", "2016-09-09", "2017-10-26", "2019-6-19"))

and a data.frame with sample data:

sample_data <- data.frame(Date = as.Date(c("2014-11-11", "2015-6-10", "2018-1-5", "2019-5-1")))

You can use findInterval to identify the closest reference date for each sample date, without going over ("price-is-right" approach):

Edit: I created a function to return the closest date (or NA if it precedes all of the reference dates):

my_fun <- function(x) {
  the_date <- ref_dates[findInterval(x, ref_dates)]
  return(ifelse(length(the_date) == 0, NA, the_date))
}

The function would be called as follows, including origin as date becomes numeric:

sample_data$Ref_Date <- as.Date(sapply(sample_data$Date, my_fun), origin = '1970-01-01')

That would give you this:

        Date   Ref_Date
1 2014-11-11 2014-09-19
2 2015-06-10 2014-09-19
3 2018-01-05 2017-10-26
4 2019-05-01 2017-10-26

You can calculate difference between the two dates (in this case stored as integer):

sample_data$Difftime <- as.integer(round(difftime(sample_data$Date, sample_data$Ref_Date, units = "days")))

Then, you can apply your rules. Options could include case_when (from dplyr) or newer fcase from latest data.table (it is fast).

library(data.table)

setDT(sample_data)

sample_data[ , Difftime_final := fcase(
  Difftime < 180L, Difftime,
  Difftime < 365L, 180L,
  Difftime >= 365L, NA_integer_
)]

Which gives the final result:

         Date   Ref_Date Difftime Difftime_final
1: 2014-11-11 2014-09-19       53             53
2: 2015-06-10 2014-09-19      264            180
3: 2018-01-05 2017-10-26       71             71
4: 2019-05-01 2017-10-26      552             NA

Determine differences between date stamps and reference dates

Answers (1)

Related Questions