Evaluation error of identity in lubridate::interval objects

Question

Assume a df like this:

df <- data.frame(id = c(rep(1:5, each = 2)),
time1 = c("2008-10-12", "2008-08-10", "2006-01-09", "2008-03-13", "2008-09-12", "2007-05-30", "2003-09-29","2003-09-29", "2003-04-01", "2003-04-01"),
time2 = c("2009-03-20", "2009-06-15", "2006-02-13", "2008-04-17", "2008-10-17", "2007-07-04", "2004-01-15", "2004-01-15", "2003-07-04", "2003-07-04"))

   id      time1      time2
1   1 2008-10-12 2009-03-20
2   1 2008-08-10 2009-06-15
3   2 2006-01-09 2006-02-13
4   2 2008-03-13 2008-04-17
5   3 2008-09-12 2008-10-17
6   3 2007-05-30 2007-07-04
7   4 2003-09-29 2004-01-15
8   4 2003-09-29 2004-01-15
9   5 2003-04-01 2003-07-04
10  5 2003-04-01 2003-07-04

What I try to do, is to, first, create a lubridate interval between the variables "time1" and "time2". Second, I want to group by "id" and compare whether the next row is the same as the current and whether the current row is the same as the previous. I can achieve it with:

library(tidyverse)

df %>%
 mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
 mutate(overlap = interval(time1, time2)) %>%
 group_by(id) %>%
 mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
        cond2 = ifelse(lag(overlap) == overlap, 1, 0))

      id time1      time2      overlap                        cond1 cond2
                                
 1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC     0    NA
 2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC    NA     0
 3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC     1    NA
 4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC    NA     1
 5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC     1    NA
 6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC    NA     1
 7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC     1    NA
 8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC    NA     1
 9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC     1    NA
10     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC    NA     1

The problem is, as you may see, that for id == 2 and id == 3, both conditions are evaluated as TRUE, even though the intervals are not the same. For id == 1, it properly evaluates as FALSE, and for id == 4 and id == 5, it properly evaluates as TRUE.

Now, when I convert the interval into character, it evaluates it all right:

df %>%
 mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
 mutate(overlap = as.character(interval(time1, time2))) %>%
 group_by(id) %>%
 mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
        cond2 = ifelse(lag(overlap) == overlap, 1, 0)) 

      id time1      time2      overlap                        cond1 cond2
                                         
 1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC     0    NA
 2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC    NA     0
 3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC     0    NA
 4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC    NA     0
 5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC     0    NA
 6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC    NA     0
 7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC     1    NA
 8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC    NA     1
 9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC     1    NA
10     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC    NA     1

The question is, why does it evaluate some intervals as identical, when they are not?

Wil · Accepted Answer

UPDATE

If you look at the code for Interval classes, you will see that when the object is created it stores the start date and then calculates the difference between start and end and stores that as .Data.

interval <- function(start, end = NULL, tzone = tz(start)) {

  if (is.null(tzone)) {
    tzone <- tz(end)
    if (is.null(tzone))
      tzone <- "UTC"
  }

  if (is.character(start) && is.null(end)) {
    return(parse_interval(start, tzone))
  }

  if (is.Date(start)) start <- date_to_posix(start)
  if (is.Date(end)) end <- date_to_posix(end)

  start <- as_POSIXct(start, tzone)
  end <- as_POSIXct(end, tzone)

  span <- as.numeric(end) - as.numeric(start)
  starts <- start + rep(0, length(span))
  if (tzone != tz(starts)) starts <- with_tz(starts, tzone)

  new("Interval", span, start = starts, tzone = tzone)
}

In other words, the returned object has no concept of the "end date". The default value for the end argument is NULL, meaning you can even create an interval without an end date.

interval("2019-03-29")
[1] 2019-03-29 UTC--NA

The "end date" is simply text generated from a calculation that occurs when the Interval object is formatted for printing.

format.Interval <- function(x, ...) {
  if (length(x@.Data) == 0) return("Interval(0)")
  paste(format(x@start, tz = x@tzone, usetz = TRUE), "--",
        format(x@start + x@.Data, tz = x@tzone, usetz = TRUE), sep = "")
}

int_end <- function(int) int@start + int@.Data

Both of those code snippets are taken from https://github.com/tidyverse/lubridate/blob/f7a7c2782ba91b821f9af04a40d93fbf9820c388/R/intervals.r.

Accessing the underlying attributes of overlap allows you to complete the comparison without converting to character. You have to check that start and .Data are both equal. Converting to character is much cleaner, but if you were trying to avoid it this is how you could do that.

ifelse(lead(overlap@start) == overlap@start & lead(overlap@.Data) == overlap@.Data, 1, 0)

Taken altogether:

df %>%
  mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
  mutate(overlap = interval(time1, time2),
         overlap_char = as.character(interval(time1, time2))) %>%
  group_by(id) %>%
  mutate(cond1_original = ifelse(lead(overlap_char) == overlap_char, 1, 0),
         cond1_new = ifelse(lead(overlap@start) == overlap@start & lead(overlap@.Data) == overlap@.Data, 1, 0),
         cond2_original = ifelse(lag(overlap_char) == overlap_char, 1, 0),
         cond2_new = ifelse(lag(overlap@start) == overlap@start & lag(overlap@.Data) == overlap@.Data, 1, 0)) 

id time1      time2      overlap                        overlap_char                   cond1_original cond1_new cond2_original cond2_new
                                                                                   
1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 2008-10-12 UTC--2009-03-20 UTC              0         0             NA        NA
2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC 2008-08-10 UTC--2009-06-15 UTC             NA        NA              0         0
3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 2006-01-09 UTC--2006-02-13 UTC              0         0             NA        NA
4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC 2008-03-13 UTC--2008-04-17 UTC             NA        NA              0         0
5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 2008-09-12 UTC--2008-10-17 UTC              0         0             NA        NA
6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC 2007-05-30 UTC--2007-07-04 UTC             NA        NA              0         0
7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC              1         1             NA        NA
8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC             NA        NA              1         1
9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC              1         1             NA        NA
10    5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC             NA        NA              1         1

You can read more about Intervals here: https://lubridate.tidyverse.org/reference/Interval-class.html

I believe your exact case has to do with the == comparison. As you can see above, "overlap" is a list, not a vector. From ?==, it says:

At least one of x and y must be an atomic vector, but if the other is a list R attempts to coerce it to the type of the atomic vector: this will succeed if the list is made up of elements of length one that can be coerced to the correct type.

If the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw.

We can coerce "overlap" to both numeric and character to see the difference.

df %>%
  mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
  mutate(overlap = interval(time1, time2)) %>%
  group_by(id) %>%
  mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
         cond2 = ifelse(lag(overlap) == overlap, 1, 0)) %>%
  mutate(overlap.n = as.numeric(overlap),
         overlap.c = as.character(overlap))

# A tibble: 10 x 8
# Groups:   id [5]
id time1      time2      overlap                        cond1 cond2 overlap.n overlap.c    
                                           
  1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC     0    NA  13737600 2008-10-12 U…
  2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC    NA     0  26697600 2008-08-10 U…
  3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC     1    NA   3024000 2006-01-09 U…
  4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC    NA     1   3024000 2008-03-13 U…
  5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC     1    NA   3024000 2008-09-12 U…
  6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC    NA     1   3024000 2007-05-30 U…
  7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC     1    NA   9331200 2003-09-29 U…
  8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC    NA     1   9331200 2003-09-29 U…
  9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC     1    NA   8121600 2003-04-01 U…
  10     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC    NA     1   8121600 2003-04-01 U…

Per the output above, I believe that using == is coercing the "overlap" interval to a numeric vector, resulting in the duration comparison @hmhensen mentions above. When you force the coercion to character rather than numeric, you get your desired result.

Evaluation error of identity in lubridate::interval objects

Answers (2)

Some more digging:

Related Questions