Frederick
Frederick

Reputation: 850

Leftjoin based on range in RHS data.frame dplyr

Please consider the following:

I have two data.frames each containing (patient) ID's, and per ID the date of assessment. Not all ID's have the same amount of assessments.

db.x contains (a small selection of) the ID's an assessment value, and the relative day of the assessment.

db.y contains ID's, a response value and its relative assessment time.

Problem For each assessment in db.x I need to find the corresponding response within the respective time frame (min to max) in db.y. But since the date of assessment in both data.frames do not match (couple of days difference between the assessments) I find this challenging.

Data in both data.frames need to be grouped by ID.

I would love to have dplyr solution but any other would work as well. Please find below my approach, which is obviously not working.

Approach and data

library(tidyverse)

# Example data
db.x <- data.frame(id = c(rep(18, 8), rep(19, 3)),
                   value = c(60, 75, 100, 100, 85, 80, 80, 90,
                             90, 80, 100),
                   time = c(-8, 85, 203, 259, 441, 623, 791, 938,
                            -7, 85, 169))

# View data
db.x
#>    id value time
#> 1  18    60   -8
#> 2  18    75   85
#> 3  18   100  203
#> 4  18   100  259
#> 5  18    85  441
#> 6  18    80  623
#> 7  18    80  791
#> 8  18    90  938
#> 9  19    90   -7
#> 10 19    80   85
#> 11 19   100  169

db.y <- data.frame(id = c(rep(18, 5), rep(19, 4)),
                   response = c("a", "a", "a", "b", "c",
                                "b", "b", "b", "b"),
                   time = c(78, 196, 251, 342, 454,
                            79, 189, 281, 303))

# View data
db.y
#>   id response time
#> 1 18        a   78
#> 2 18        a  196
#> 3 18        a  251
#> 4 18        b  342
#> 5 18        c  454
#> 6 19        b   79
#> 7 19        b  189
#> 8 19        b  281
#> 9 19        b  303

# Extract the min and max time of the response
db.y <- db.y %>%
        group_by(id, response) %>%
        mutate(min = min(time), max = max(time)) %>%
        distinct(id, response, min, max) %>% 
        ungroup

db.y
#> # A tibble: 4 x 4
#>      id response   min   max
#>   <dbl> <fct>    <dbl> <dbl>
#> 1    18 a           78   251
#> 2    18 b          342   342
#> 3    18 c          454   454
#> 4    19 b           79   303

# PROBLEM: How can I match the responses in db.x to the min/max times in db.y?
db.x %>%
        group_by(id) %>%
        mutate(response = ifelse(time %in% db.y %>% group_by(id = id) %>% select(min, max),
                                 response, NA))
#> Error in mutate_impl(.data, dots): Evaluation error: no applicable method for 'group_by_' applied to an object of class "logical".

# Desired output
db.x %>% 
        mutate(response = c(NA, "a", "a", NA, NA, NA, NA, NA, NA, "b", "b"))
#>    id value time response
#> 1  18    60   -8     <NA>
#> 2  18    75   85        a
#> 3  18   100  203        a
#> 4  18   100  259     <NA>
#> 5  18    85  441     <NA>
#> 6  18    80  623     <NA>
#> 7  18    80  791     <NA>
#> 8  18    90  938     <NA>
#> 9  19    90   -7     <NA>
#> 10 19    80   85        b
#> 11 19   100  169        b

Created on 2018-11-12 by the reprex package (v0.2.1)

Thank you very much!

Upvotes: 1

Views: 48

Answers (1)

arg0naut91
arg0naut91

Reputation: 14764

You could go for a full_join if you would like to stay within tidyverse framework (which otherwise does not support non-equi joins):

library(dplyr)

db.x %>%
  full_join(db.y) %>%
  mutate(
    response = if_else(time >= min & time <= max, as.character(response), NA_character_)
  ) %>% distinct(id, value, time, .keep_all = TRUE) %>%
  select(-min, -max)

Output:

   id value time response
1  18    60   -8     <NA>
2  18    75   85        a
3  18   100  203        a
4  18   100  259     <NA>
5  18    85  441     <NA>
6  18    80  623     <NA>
7  18    80  791     <NA>
8  18    90  938     <NA>
9  19    90   -7     <NA>
10 19    80   85        b
11 19   100  169        b

However, this is much more straightforward and scalable in data.table:

library(data.table)

setDT(db.y)[setDT(db.x), on = .(id = id, min <= time, max >= time), .(id, value, time, response)]

Output:

    id value time response
 1: 18    60   -8     <NA>
 2: 18    75   85        a
 3: 18   100  203        a
 4: 18   100  259     <NA>
 5: 18    85  441     <NA>
 6: 18    80  623     <NA>
 7: 18    80  791     <NA>
 8: 18    90  938     <NA>
 9: 19    90   -7     <NA>
10: 19    80   85        b
11: 19   100  169        b

Comparison in terms of speed:

Unit: milliseconds
       expr      min       lq     mean   median       uq       max neval
 tidyverser 5.703497 6.369896 7.400882 7.033012 8.043276 12.162548   100
         dt 1.812313 2.088171 2.506833 2.485092 2.958956  3.384321   100

Upvotes: 1

Related Questions