seansteele
seansteele

Reputation: 749

Join overlapping ranges from two data frames in r

Note: This question was closed as a 'duplicate'. The solutions offered here and here did not answer my question. They showed how to merge when a single entry fell within a range, I'm trying to identify overlapping ranges and joining them. Perhaps my title could have been better...

I have a main data set main_df with a start and end time (in seconds). I would like to see if the time range in main_df falls within a list of ranges in lookup_df, and if so, grab the value from lookup_df. Additionally, if the main_df falls within two different lookup ranges, duplicate the row so each value is represented.***

main_df <- tibble(start = c(30,124,161),
                end = c(80,152,185))

lookup_df <- tibble(start = c(34,73,126,141,174,221),
                       end = c(69,123,136,157,189,267),
                       value = c('a','b','b','b','b','a'))

# Do something here to get the following:

> final_df
# A tibble: 4 x 4
  start   end value notes                                      
  <dbl> <dbl> <chr> <chr>                                      
1    30    80 a     ""                                         
2    30    80 b     "Duplicate because it falls within a and b"
3   124   152 b     "Falls within two lookups but both are b"  
4   161   185 b     ""      

***Edit: Looking at the way I've structured the problem...

#Not actual code
left_join(main_df, lookup_df, by(some_range_join_function) %>% 
  add_rows(through_some_means)

Rather than having to add a new row I could flip how I'm joining them...

semi_join(lookup_df, main_df, by(some_range_join_function))

Upvotes: 2

Views: 1353

Answers (5)

shs
shs

Reputation: 3899

You can use the fuzzyjoin package to join based on intervals with the fuzzyjoin::interval_*_join() functions.

I'll be using an inner join, because if you use a semi join like you propose, you will loose the value col and get just 3 rows.

library(tidyverse)
library(fuzzyjoin)

fuzzyjoin::interval_inner_join(lookup_df, main_df, by = c("start", "end"), type = "any")
#> # A tibble: 5 × 5
#>   start.x end.x value start.y end.y
#>     <dbl> <dbl> <chr>   <dbl> <dbl>
#> 1      34    69 a          30    80
#> 2      73   123 b          30    80
#> 3     126   136 b         124   152
#> 4     141   157 b         124   152
#> 5     174   189 b         161   185

As you can see, the fuzzy_inner_join() preserves the by cols from both tables, since they are not the same in a fuzzy join. Also, we still have separate rows for those cases in main_df that match multiple cases in lookup_df. Thus, we do some cleanup of the joined table:

interval_inner_join(lookup_df, main_df, 
                    by = c("start", "end"), 
                    type = "any") |> 
  select(-ends_with(".x")) |> # remove lookup interval cols
  distinct() |> # remove duplicate
  rename_with(str_remove, ends_with(".y"), "\\.y") # remove suffixes from col names
#> # A tibble: 4 × 3
#>   value start   end
#>   <chr> <dbl> <dbl>
#> 1 a        30    80
#> 2 b        30    80
#> 3 b       124   152
#> 4 b       161   185

Finally, a clarification of terminology: In your question you state you want to join based on the interval from main_df falling within the interval from lookup_df. This is possible by using type = "within" in interval_*_join(). But based on the examples you provide, it appears you want to join based on any overlap. This can be done with type = "any", but it is the default, so you don't need to specify it.

Upvotes: 0

PaulS
PaulS

Reputation: 25528

A possible solution, based on powerjoin:

library(tidyverse)
library(powerjoin)

power_left_join(
  main_df, lookup_df,
  by = ~ (.x$start <= .y$start & .x$end >= .y$end) |
    (.x$start >= .y$start & .x$start <= .y$end) | 
    (.x$start <= .y$start & .x$end >= .y$start), 
  keep = "left") %>% 
  distinct()

#> # A tibble: 4 x 3
#>   start   end value
#>   <dbl> <dbl> <chr>
#> 1    30    80 a    
#> 2    30    80 b    
#> 3   124   152 b    
#> 4   161   185 b

Or using tidyr::crossing:

library(tidyverse)

crossing(main_df, lookup_df,
        .name_repair = ~ c("start", "end", "start2", "end2", "value")) %>% 
  filter((start <= start2 & end >= end2) |
         (start >= start2 & start <= end2) | (start <= start2 & end >= start2)) %>% 
  select(-start2, -end2) %>% 
  distinct()

#> # A tibble: 4 x 3
#>   start   end value
#>   <dbl> <dbl> <chr>
#> 1    30    80 a    
#> 2    30    80 b    
#> 3   124   152 b    
#> 4   161   185 b

Upvotes: 0

Merijn van Tilborg
Merijn van Tilborg

Reputation: 5897

You can use foverlaps from data.table for this.

library(data.table)

setDT(main_df) # make it a data.table if needed
setDT(lookup_df) # make it a data.table if needed

setkey(main_df, start, end) # set the keys of 'y'

foverlaps(lookup_df, main_df, nomatch = NULL) # do the lookup

#    start end i.start i.end value
# 1:    30  80      34    69     a
# 2:    30  80      73   123     b
# 3:   124 152     126   136     b
# 4:   124 152     141   157     b
# 5:   161 185     174   189     b

Or to get the cleaned results as end result (OP's final_df)

unique(foverlaps(lookup_df, main_df, nomatch = NULL)[, .(start, end, value)])

   start end value
1:    30  80     a
2:    30  80     b
3:   124 152     b
4:   161 185     b

Upvotes: 1

Ma&#235;l
Ma&#235;l

Reputation: 52349

Another option is fuzzyjoin::interval_join:

library(fuzzyjoin)
library(dplyr)

interval_join(main_df, lookup_df, by = c("start", "end"), mode = "inner") %>% 
  group_by(value, start.x, end.x) %>% 
  slice(1) %>% 
  select(start = start.x, end = end.x, value)

# A tibble: 4 × 3
# Groups:   value, start, end [4]
  start   end value
  <dbl> <dbl> <chr>
1    30    80 a    
2    30    80 b    
3   124   152 b    
4   161   185 b    

Upvotes: 1

jay.sf
jay.sf

Reputation: 73692

You could do some logical comparisons and then a case handling what shall happen if all are 'b', 'a' and 'b', etc. In this way you easily could add more cases, e.g. both are 'a', one is 'a', more are 'b' which you didn't declare in OP. The approach yields NULL if there are no matches which gets omitted during rbind.

f <- \(x, y) {
  w <- which((x[1] >= y[, 1] & x[1] <= y[, 2]) | (x[2] >= y[, 1] & x[1] <= y[, 2]))
  if (length(w) > 0) {
    d <- data.frame(t(x), value=cbind(y[w, 3]), notes='')
    if (length(w) >= 2) {
      if (all(d$value == 'b')) {
        d <- d[!duplicated(d$value), ]
        d$notes[1] <- 'both b'
      }
      else {
        d$notes[nrow(d)] <- 'a & b'
      }
    }
    d
  }
}

apply(main_df, 1, f, lookup_df, simplify=F) |> do.call(what=rbind)
#   start end value  notes
# 1    30  80     a       
# 2    30  80     b  a & b
# 3   124 152     b both b
# 4   161 185     b     

Data:

main_df <- structure(list(start = c(2, 30, 124, 161), end = c(1, 80, 152, 
185)), row.names = c(NA, -4L), class = "data.frame")

lookup_df <- structure(list(start = c(34, 73, 126, 141, 174, 221), end = c(69, 
123, 136, 157, 189, 267), value = c("a", "b", "b", "b", "b", 
"a")), row.names = c(NA, -6L), class = "data.frame")

Upvotes: 1

Related Questions