Reputation: 39858
Assume a df like this:
df <- data.frame(id = c(rep(1:5, each = 2)),
time1 = c("2008-10-12", "2008-08-10", "2006-01-09", "2008-03-13", "2008-09-12", "2007-05-30", "2003-09-29","2003-09-29", "2003-04-01", "2003-04-01"),
time2 = c("2009-03-20", "2009-06-15", "2006-02-13", "2008-04-17", "2008-10-17", "2007-07-04", "2004-01-15", "2004-01-15", "2003-07-04", "2003-07-04"))
id time1 time2
1 1 2008-10-12 2009-03-20
2 1 2008-08-10 2009-06-15
3 2 2006-01-09 2006-02-13
4 2 2008-03-13 2008-04-17
5 3 2008-09-12 2008-10-17
6 3 2007-05-30 2007-07-04
7 4 2003-09-29 2004-01-15
8 4 2003-09-29 2004-01-15
9 5 2003-04-01 2003-07-04
10 5 2003-04-01 2003-07-04
What I try to do, is to, first, create a lubridate
interval between the variables "time1" and "time2". Second, I want to group by "id" and compare whether the next row is the same as the current and whether the current row is the same as the previous. I can achieve it with:
library(tidyverse)
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0))
id time1 time2 overlap cond1 cond2
<int> <date> <date> <S4: Interval> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 1 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 1
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 1 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 1
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1
The problem is, as you may see, that for id == 2 and id == 3, both conditions are evaluated as TRUE, even though the intervals are not the same. For id == 1, it properly evaluates as FALSE, and for id == 4 and id == 5, it properly evaluates as TRUE.
Now, when I convert the interval into character, it evaluates it all right:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = as.character(interval(time1, time2))) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0))
id time1 time2 overlap cond1 cond2
<int> <date> <date> <chr> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 0 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 0
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 0 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 0
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1
The question is, why does it evaluate some intervals as identical, when they are not?
Upvotes: 1
Views: 609
Reputation: 3178
UPDATE
If you look at the code for Interval
classes, you will see that when the object is created it stores the start date and then calculates the difference between start and end and stores that as .Data
.
interval <- function(start, end = NULL, tzone = tz(start)) {
if (is.null(tzone)) {
tzone <- tz(end)
if (is.null(tzone))
tzone <- "UTC"
}
if (is.character(start) && is.null(end)) {
return(parse_interval(start, tzone))
}
if (is.Date(start)) start <- date_to_posix(start)
if (is.Date(end)) end <- date_to_posix(end)
start <- as_POSIXct(start, tzone)
end <- as_POSIXct(end, tzone)
span <- as.numeric(end) - as.numeric(start)
starts <- start + rep(0, length(span))
if (tzone != tz(starts)) starts <- with_tz(starts, tzone)
new("Interval", span, start = starts, tzone = tzone)
}
In other words, the returned object has no concept of the "end date". The default value for the end
argument is NULL
, meaning you can even create an interval without an end date.
interval("2019-03-29")
[1] 2019-03-29 UTC--NA
The "end date" is simply text generated from a calculation that occurs when the Interval
object is formatted for printing.
format.Interval <- function(x, ...) {
if (length([email protected]) == 0) return("Interval(0)")
paste(format(x@start, tz = x@tzone, usetz = TRUE), "--",
format(x@start + [email protected], tz = x@tzone, usetz = TRUE), sep = "")
}
int_end <- function(int) int@start + [email protected]
Both of those code snippets are taken from https://github.com/tidyverse/lubridate/blob/f7a7c2782ba91b821f9af04a40d93fbf9820c388/R/intervals.r.
Accessing the underlying attributes of overlap
allows you to complete the comparison without converting to character. You have to check that start
and .Data
are both equal. Converting to character is much cleaner, but if you were trying to avoid it this is how you could do that.
ifelse(lead(overlap@start) == overlap@start & lead([email protected]) == [email protected], 1, 0)
Taken altogether:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2),
overlap_char = as.character(interval(time1, time2))) %>%
group_by(id) %>%
mutate(cond1_original = ifelse(lead(overlap_char) == overlap_char, 1, 0),
cond1_new = ifelse(lead(overlap@start) == overlap@start & lead([email protected]) == [email protected], 1, 0),
cond2_original = ifelse(lag(overlap_char) == overlap_char, 1, 0),
cond2_new = ifelse(lag(overlap@start) == overlap@start & lag([email protected]) == [email protected], 1, 0))
id time1 time2 overlap overlap_char cond1_original cond1_new cond2_original cond2_new
<int> <date> <date> <S4: Interval> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 2008-10-12 UTC--2009-03-20 UTC 0 0 NA NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC 2008-08-10 UTC--2009-06-15 UTC NA NA 0 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 2006-01-09 UTC--2006-02-13 UTC 0 0 NA NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC 2008-03-13 UTC--2008-04-17 UTC NA NA 0 0
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 2008-09-12 UTC--2008-10-17 UTC 0 0 NA NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC 2007-05-30 UTC--2007-07-04 UTC NA NA 0 0
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC 1 1 NA NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC NA NA 1 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC 1 1 NA NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC NA NA 1 1
You can read more about Interval
s here: https://lubridate.tidyverse.org/reference/Interval-class.html
I believe your exact case has to do with the ==
comparison. As you can see above, "overlap" is a list,
not a vector. From ?==
, it says:
At least one of x and y must be an atomic vector, but if the other is a list R attempts to coerce it to the type of the atomic vector: this will succeed if the list is made up of elements of length one that can be coerced to the correct type.
If the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw.
We can coerce "overlap" to both numeric
and character
to see the difference.
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0)) %>%
mutate(overlap.n = as.numeric(overlap),
overlap.c = as.character(overlap))
# A tibble: 10 x 8
# Groups: id [5]
id time1 time2 overlap cond1 cond2 overlap.n overlap.c
<int> <date> <date> <S4: Interval> <dbl> <dbl> <dbl> <chr>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA 13737600 2008-10-12 U…
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0 26697600 2008-08-10 U…
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 1 NA 3024000 2006-01-09 U…
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 1 3024000 2008-03-13 U…
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 1 NA 3024000 2008-09-12 U…
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 1 3024000 2007-05-30 U…
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA 9331200 2003-09-29 U…
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1 9331200 2003-09-29 U…
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA 8121600 2003-04-01 U…
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1 8121600 2003-04-01 U…
Per the output above, I believe that using ==
is coercing the "overlap" interval to a numeric
vector, resulting in the duration comparison @hmhensen mentions above. When you force the
coercion to character
rather than numeric
, you get your desired result.
Upvotes: 2
Reputation: 3195
I think it has to do with what lubridate
is actually calculating.
When I calculate the differences between date1
and date2
, this happens:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = time2 - time1)
id time1 time2 overlap
1 1 2008-10-12 2009-03-20 159 days
2 1 2008-08-10 2009-06-15 309 days
3 2 2006-01-09 2006-02-13 35 days
4 2 2008-03-13 2008-04-17 35 days
5 3 2008-09-12 2008-10-17 35 days
6 3 2007-05-30 2007-07-04 35 days
7 4 2003-09-29 2004-01-15 108 days
8 4 2003-09-29 2004-01-15 108 days
9 5 2003-04-01 2003-07-04 94 days
10 5 2003-04-01 2003-07-04 94 days
So we can tell the intervals are the same in day length.
Now, what is overlap
actually calculating? To find out I changed your code slightly to report the lead and lag instead of 1.
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, lead(overlap), 0),
cond2 = ifelse(lag(overlap) == overlap, lag(overlap), 0))
# A tibble: 10 x 6
# Groups: id [5]
id time1 time2 overlap cond1 cond2
<int> <date> <date> <S4: Interval> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 3024000 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 3024000
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 3024000 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 3024000
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 9331200 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 9331200
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 8121600 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 8121600
Here, we see that lead
and lag
actually calculate the differences in a specific time interval rather than looking at the actual interval start and end dates. That would appear why it sees certain intervals as equal and the character strings as unequal, as they ought to be.
Let's take a look at the object produced by interval
.
a <- interval(df$time1, df$time2)
str(a)
#Formal class 'Interval' [package "lubridate"] with 3 slots
#..@ .Data: num [1:10] 13737600 26697600 3024000 3024000 3024000 ...
#..@ start: POSIXct[1:10], format: "2008-10-12" "2008-08-10" "2006-01-09" ...
#..@ tzone: chr "UTC"
It's an S4 class with three slots: .Data
, start
and tzone
.
Calling a
shows us the intervals.
a
[1] 2008-10-12 UTC--2009-03-20 UTC 2008-08-10 UTC--2009-06-15 UTC 2006-01-09 UTC--2006-02-13 UTC
[4] 2008-03-13 UTC--2008-04-17 UTC 2008-09-12 UTC--2008-10-17 UTC 2007-05-30 UTC--2007-07-04 UTC
[7] 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC 2003-04-01 UTC--2003-07-04 UTC
[10] 2003-04-01 UTC--2003-07-04 UTC
But when you performed a calculation on a
, it did it on .Data
, which is a sequence of seconds that begin at a specified date (see ?interval
).
[email protected]
#[1] 13737600 26697600 3024000 3024000 3024000 3024000 9331200 9331200 8121600 8121600
For the start date of the interval, we need to access start
slot.
a@start
#[1] "2008-10-12 UTC" "2008-08-10 UTC" "2006-01-09 UTC" "2008-03-13 UTC" "2008-09-12 UTC"
#[6] "2007-05-30 UTC" "2003-09-29 UTC" "2003-09-29 UTC" "2003-04-01 UTC" "2003-04-01 UTC"
And the timezone...
a@tzone
#[1] "UTC"
We can also look at what the relationships between the elements are. The last and next to last elements had the same intervals.
a[9] == a[10]
#[1] TRUE
And they're identical objects.
identical(a[9], a[10])
#[1] TRUE
But what is it really checking when you check to see if the elements are equal? Elements 3 and 4 had the same time difference, but were not the same intervals. Therefore, when you checked to see if their lag/leads were equal, it returned TRUE
. But since they have different interval dates, they shouldn't be. So when we check if they're identical, only then do we get what we expected.
a[3] == a[4]
#[1] TRUE
a[3]@.Data == a[4]@.Data
#[1] TRUE
identical(a[3], a[4])
#[1] FALSE
So what happened? What a[3] == a[4]
really checks is a[3]@.Data == a[4]@.Data
and therefore it's checking to see if 3024000
equals 3024000
. It does so it returns TRUE
. But identical checks all the slots and finds that they are not the same because start
in each are different.
Then I thought about using identical with lead/lag so that we could fit one logical into the code, but look at this.
a[9]
#[1] 2003-04-01 UTC--2003-07-04 UTC
# now lead
lead(a[9])
#2003-04-01 UTC--NA
The output does not look like a[10]
as expected.
#now lag
lag(a[9])
#[1] NA
#attr(,"start")
#[1] "2003-04-01 UTC"
#attr(,"tzone")
#[1] "UTC"
#attr(,"class")
#[1] "Interval"
#attr(,"class")attr(,"package")
#[1] "lubridate"
So lead
and lag
have a different effect on class S4 objects. To get a better handle on what your first attempt was outputting, I did this:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = lead(overlap),
cond2 = lag(overlap))
I got a lot of warning messages that said
#In mutate_impl(.data, dots) :
# Vectorizing 'Interval' elements may not preserve their attributes
I don't know enough about R objects to understand how data in S4 class is stored, but it certainly looks different than the typical S3 object.
Seems like using as.character
, as you did, is the way to go.
Upvotes: 7