Zelbinian
Zelbinian

Reputation: 3417

Sometimes the %within% function in lubridate does not return the expected results - what am I doing wrong?

I have a data frame/tibble relating to projects that includes datetime data. There are two datetime columns, one for when the project started - Launch Date - and one for when the project is due to end - End Date.

For whatever reason, the %within% function operates as expected when I'm filtering the End Date column but not the Launch Date column and I cannot figure out why.

I'll take you through my investigation so far step by step. All of this is copied directly from code/consoles so if there are typos or mistakes pointing them out is legit. (However, for this discussion, stylistic changes are not helpful.) Also, because we're dealing with dates, all of the example code below assumes that the refence date of "today" is the day I'm posting this, March 12, 2021.

First, I create the intervals I'm looking for (projects ending within a week, starting tomorrow; projects that began within the last week not including today):

endInterval <- interval(today() + days(1), today() + weeks(1))
newInterval <- interval(today() - weeks(1), today() - days(1))

Next I test that the intervals contain the expected range of dates:

> int_end(endInterval)
[1] "2021-03-19 UTC"
> int_start(endInterval)
[1] "2021-03-13 UTC"
> int_end(newInterval)
[1] "2021-03-11 UTC"
> int_start(newInterval)
[1] "2021-03-05 UTC"

And that the %within% function works with the intervals as expected:

> int_end(endInterval) %within% endInterval
[1] TRUE
> int_start(endInterval) %within% endInterval
[1] TRUE
> int_end(newInterval) %within% newInterval
[1] TRUE
> int_start(newInterval) %within% newInterval
[1] TRUE

So far, so tautological. Now I'm going to filter the main data frame/tibble (simply called data) using these intervals and the %within% function. First I'll filter using the endInterval I've set up and show that it's working as expected by saving the results to new data frame and then looking at the unique dates in the End Date column:

> endData <- data %>% filter(`End Date` %within% endInterval) %>% arrange(`End Date`)
> unique(endData$`End Date`)
 [1] "2021-03-13 03:00:00 UTC" "2021-03-13 03:24:00 UTC" "2021-03-13 07:00:00 UTC" "2021-03-13 12:00:00 UTC" "2021-03-13 15:34:00 UTC" "2021-03-14 16:39:00 UTC"
 [7] "2021-03-15 09:50:00 UTC" "2021-03-15 13:02:00 UTC" "2021-03-15 14:01:00 UTC" "2021-03-15 17:11:00 UTC" "2021-03-15 18:00:00 UTC" "2021-03-15 20:00:00 UTC"
[13] "2021-03-16 00:00:00 UTC" "2021-03-16 12:46:00 UTC" "2021-03-16 13:59:00 UTC" "2021-03-16 14:59:00 UTC" "2021-03-16 17:30:00 UTC" "2021-03-16 18:45:00 UTC"
[19] "2021-03-16 19:00:00 UTC" "2021-03-16 20:00:00 UTC" "2021-03-16 22:59:00 UTC" "2021-03-17 02:00:00 UTC" "2021-03-17 03:00:00 UTC" "2021-03-17 06:35:00 UTC"
[25] "2021-03-17 06:59:00 UTC" "2021-03-17 10:56:00 UTC" "2021-03-17 14:58:00 UTC" "2021-03-17 16:00:00 UTC" "2021-03-17 23:00:00 UTC" "2021-03-18 00:00:00 UTC"
[31] "2021-03-18 03:59:00 UTC" "2021-03-18 14:53:00 UTC" "2021-03-18 17:12:00 UTC" "2021-03-18 20:06:00 UTC" "2021-03-19 00:00:00 UTC"

This demonstrates that the data being returned does include data that is inclusive of the endpoints of the interval, as expected.

The problem comes when I do the exact same thing using the newInterval with the Launch Date column.

> newData <- data %>% filter(`Launch Date` %within% newInterval) %>% arrange(`Launch Date`)
> unique(newData$`Launch Date`)
 [1] "2021-03-05 15:00:00 UTC" "2021-03-05 15:04:00 UTC" "2021-03-05 18:11:00 UTC" "2021-03-05 22:07:00 UTC" "2021-03-06 15:00:00 UTC" "2021-03-06 17:03:00 UTC"
 [7] "2021-03-08 06:20:00 UTC" "2021-03-08 08:07:00 UTC" "2021-03-08 11:16:00 UTC" "2021-03-08 12:00:00 UTC" "2021-03-08 14:03:00 UTC" "2021-03-08 16:11:00 UTC"
[13] "2021-03-08 16:59:00 UTC" "2021-03-08 17:59:00 UTC" "2021-03-08 18:03:00 UTC" "2021-03-08 18:12:00 UTC" "2021-03-08 19:37:00 UTC" "2021-03-09 05:01:00 UTC"
[19] "2021-03-09 08:00:00 UTC" "2021-03-09 08:38:00 UTC" "2021-03-09 10:42:00 UTC" "2021-03-09 12:12:00 UTC" "2021-03-09 13:00:00 UTC" "2021-03-09 14:00:00 UTC"
[25] "2021-03-09 14:37:00 UTC" "2021-03-09 16:00:00 UTC" "2021-03-09 16:11:00 UTC" "2021-03-09 16:30:00 UTC" "2021-03-09 16:59:00 UTC" "2021-03-09 17:03:00 UTC"
[31] "2021-03-09 17:07:00 UTC" "2021-03-09 17:59:00 UTC" "2021-03-09 18:00:00 UTC" "2021-03-09 19:17:00 UTC" "2021-03-09 19:52:00 UTC" "2021-03-09 20:03:00 UTC"
[37] "2021-03-10 01:34:00 UTC" "2021-03-10 02:40:00 UTC" "2021-03-10 02:45:00 UTC" "2021-03-10 13:25:00 UTC" "2021-03-10 14:43:00 UTC" "2021-03-10 15:56:00 UTC"
[43] "2021-03-10 16:48:00 UTC"

The data returned includes datetimes that are inclusive of the beginning of the interval, but not the end. And yes, there is data that has Launch Dates on March 11, 2021. (This is truncated to prevent this from being too long.)

> unique(data$`Launch Date`)
  [1] "2021-03-01 18:42:00 UTC" "2021-02-06 00:27:00 UTC" "2021-02-16 15:53:00 UTC" "2021-03-01 16:59:00 UTC" "2021-03-06 17:03:00 UTC"
  [6] "2021-02-06 01:43:00 UTC" "2021-03-01 05:01:00 UTC" "2021-03-02 17:59:00 UTC" "2021-02-01 07:18:00 UTC" "2021-02-12 17:39:00 UTC"
....
[171] "2021-03-11 16:04:00 UTC" "2021-03-11 17:56:00 UTC" "2021-03-11 21:08:00 UTC" "2021-03-11 02:46:00 UTC" "2021-03-11 14:00:00 UTC"
[176] "2021-03-11 16:00:00 UTC" "2021-03-11 17:42:00 UTC" "2021-03-11 16:36:00 UTC" "2021-03-11 18:53:00 UTC" "2021-03-11 04:29:00 UTC"

These ought to be behaving identically, but they do not and I have no idea why. Hopefully one of you does.

Upvotes: 0

Views: 438

Answers (1)

Vons
Vons

Reputation: 3335

Well, it's simply because those datetimes aren't in the newInterval. The last time that is in newInterval is 2021-03-11 00:00:00 UTC. Which is to say, there is no exact time that is 2021-03-11 00:00:00 UTC in your LaunchDate leading you to suspect that any times in that day should be there, but observe that the only timestamp that is on the end date of the endInterval is at 00:00:00.

as_datetime("2021-03-10 23:59:59 UTC") %within% newInterval
TRUE

as_datetime("2021-03-11 00:00:00 UTC") %within% newInterval
TRUE

as_datetime("2021-03-11 00:00:01 UTC") %within% newInterval
FALSE

as_datetime("2021-03-11 16:04:00 UTC") %within% newInterval
FALSE

You can adjust the newInterval to go to the end of the day.

newInterval=interval(as_datetime("2021-03-05 UTC 00:00:00"), as_datetime("2021-03-11 23:59:59 UTC"))

as_datetime("2021-03-11 00:00:01 UTC") %within% newInterval
TRUE

as_datetime("2021-03-11 16:04:00 UTC") %within% newInterval
TRUE

Upvotes: 1

Related Questions