pasipasi
pasipasi

Reputation: 1226

Joining two data frames with intervals misbehaves?

Edit (2019-06): This problem does not exist anymore, as this issue has been closed and a related feature implemented. If you now run the code with updated packages, it will work.

I'm trying to find overlapping intervals and decided to join the interval data on itself with dplyr::left_join() so that I could compare intervals with lubridate::int_overlaps() to every other interval by the same id.

Here's how I expect left_join() to behave. The two tibbles with three rows cross to form a with 9 rows:

library(tidyverse)

tibble(a = rep("a", 3), b = rep(1, 3)) %>% 
  left_join(tibble(a = rep("a", 3), c = rep(2, 3)))
Joining, by = "a"
# A tibble: 9 x 3
      a     b     c
  <chr> <dbl> <dbl>
1     a     1     2
2     a     1     2
3     a     1     2
4     a     1     2
5     a     1     2
6     a     1     2
7     a     1     2
8     a     1     2
9     a     1     2

And here's how the same code behaves with intervals. I get nine rows but the rows don't cross like they do above:

tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>% 
  left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
Joining, by = "a"
# A tibble: 9 x 3
      a                              b                              c
  <chr>                 <S4: Interval>                 <S4: Interval>
1     a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2     a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3     a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4     a                         NA--NA                         NA--NA
5     a                         NA--NA                         NA--NA
6     a                         NA--NA                         NA--NA
7     a                         NA--NA                         NA--NA
8     a                         NA--NA                         NA--NA
9     a                         NA--NA                         NA--NA

I think this is unexpected, but I might be missing something? Or is it a bug?

I'm using 1.7.1, 1.3.4 and 0.7.4.

Upvotes: 9

Views: 339

Answers (3)

pasipasi
pasipasi

Reputation: 1226

This problem does not exist anymore, as this issue has been closed and a related feature implemented. If you now run the code with updated packages, it will work.

library(lubridate)
library(tidyverse)

tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>% 
  left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))))
#> Joining, by = "a"
#> # A tibble: 9 x 3
#>   a     b                              c                             
#>   <chr> <Interval>                     <Interval>                    
#> 1 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 2 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 3 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 4 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 5 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 6 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 7 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 8 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
#> 9 a     2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC

Created on 2019-06-07 by the reprex package (v0.3.0)

Upvotes: 1

moodymudskipper
moodymudskipper

Reputation: 47300

The bug

The object still contains the relevant information:

res <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>% 
  left_join(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003)))) 

print.data.frame(res)
# a                              b                              c
# 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC

res$c    
# [1] 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# [5] 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# [9] 2002-01-01 UTC--2003-01-01 UTC

But when subsetting by indices it doesn't work anywmore :

res_df <- as.data.frame(res)

head(res_df)
  a                              b                              c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a                         NA--NA                         NA--NA
5 a                         NA--NA                         NA--NA
6 a                         NA--NA                         NA--NA

res_df[4,"c"]
[1] NA--NA

and tibble:::print.tbl makes use of head. That's why the issue is immediately visible with tibbles and not with data.frames.

Typing str(res$b) we see that we only have 3 start values for 9 data values.

if we do:

res_df$b@start <- rep(res_df$b@start,3)
res_df$c@start <- rep(res_df$c@start,3)

eveything now print fine:

  a                              b                              c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC

The Solution

We've seen that as.data.frame is not enough, left_join is the function messing things up, use merge instead:

res <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3)) %>% 
  merge(tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003))),
        all.x=TRUE) 

head(res)
# a                              b                              c
# 1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
# 6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC

res[4,"c"]
#[1] 2002-01-01 UTC--2003-01-01 UTC

I've reported the issue here

Upvotes: 7

renato vitolo
renato vitolo

Reputation: 1754

Looks like a bug in tibble():

> AA <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3))
> class(AA$b)
[1] "Interval"
attr(,"package")
[1] "lubridate"
> AA
Error in round_x - lhs :
  Arithmetic operators undefined for 'Interval' and 'Interval' classes:
  convert one to numeric or a matching time-span class.

However:

> AA <- as.data.frame(AA)
class(AA$b)
> class(AA$b)
[1] "Interval"
attr(,"package")
[1] "lubridate"
> AA
  a                              b
1 a 2001-01-01 UTC--2002-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC

Therefore, this works:

> AA <- tibble(a = rep("a", 3), b = rep(make_date(2001) %--% make_date(2002), 3))
> BB <- tibble(a = rep("a", 3), c = rep(make_date(2002) %--% make_date(2003)))
> AA %>% as.data.frame %>% left_join(BB)
Joining, by = "a"
  a                              b                              c
1 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
2 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
3 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
4 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
5 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
6 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
7 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
8 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC
9 a 2001-01-01 UTC--2002-01-01 UTC 2002-01-01 UTC--2003-01-01 UTC

although this does not:

> AA %>% left_join(BB)
Joining, by = "a"
Error in round_x - lhs :
  Arithmetic operators undefined for 'Interval' and 'Interval' classes:
  convert one to numeric or a matching time-span class.

Note: I'm using tibble_1.4.1 (same version of lubridate and dplyr as you), on R 3.4.3 for x86_64-pc-linux-gnu

Upvotes: 4

Related Questions