Reputation: 850
Please consider the following:
I have two data.frames
each containing (patient) ID's, and per ID the date of assessment. Not all ID's have the same amount of assessments.
db.x
contains (a small selection of) the ID's an assessment value, and the relative day of the assessment.
db.y
contains ID's, a response value and its relative assessment time.
Problem
For each assessment in db.x
I need to find the corresponding response within the respective time frame (min
to max
) in db.y
. But since the date of assessment in both data.frames
do not match (couple of days difference between the assessments) I find this challenging.
Data in both data.frames
need to be grouped by ID
.
I would love to have dplyr
solution but any other would work as well. Please find below my approach, which is obviously not working.
Approach and data
library(tidyverse)
# Example data
db.x <- data.frame(id = c(rep(18, 8), rep(19, 3)),
value = c(60, 75, 100, 100, 85, 80, 80, 90,
90, 80, 100),
time = c(-8, 85, 203, 259, 441, 623, 791, 938,
-7, 85, 169))
# View data
db.x
#> id value time
#> 1 18 60 -8
#> 2 18 75 85
#> 3 18 100 203
#> 4 18 100 259
#> 5 18 85 441
#> 6 18 80 623
#> 7 18 80 791
#> 8 18 90 938
#> 9 19 90 -7
#> 10 19 80 85
#> 11 19 100 169
db.y <- data.frame(id = c(rep(18, 5), rep(19, 4)),
response = c("a", "a", "a", "b", "c",
"b", "b", "b", "b"),
time = c(78, 196, 251, 342, 454,
79, 189, 281, 303))
# View data
db.y
#> id response time
#> 1 18 a 78
#> 2 18 a 196
#> 3 18 a 251
#> 4 18 b 342
#> 5 18 c 454
#> 6 19 b 79
#> 7 19 b 189
#> 8 19 b 281
#> 9 19 b 303
# Extract the min and max time of the response
db.y <- db.y %>%
group_by(id, response) %>%
mutate(min = min(time), max = max(time)) %>%
distinct(id, response, min, max) %>%
ungroup
db.y
#> # A tibble: 4 x 4
#> id response min max
#> <dbl> <fct> <dbl> <dbl>
#> 1 18 a 78 251
#> 2 18 b 342 342
#> 3 18 c 454 454
#> 4 19 b 79 303
# PROBLEM: How can I match the responses in db.x to the min/max times in db.y?
db.x %>%
group_by(id) %>%
mutate(response = ifelse(time %in% db.y %>% group_by(id = id) %>% select(min, max),
response, NA))
#> Error in mutate_impl(.data, dots): Evaluation error: no applicable method for 'group_by_' applied to an object of class "logical".
# Desired output
db.x %>%
mutate(response = c(NA, "a", "a", NA, NA, NA, NA, NA, NA, "b", "b"))
#> id value time response
#> 1 18 60 -8 <NA>
#> 2 18 75 85 a
#> 3 18 100 203 a
#> 4 18 100 259 <NA>
#> 5 18 85 441 <NA>
#> 6 18 80 623 <NA>
#> 7 18 80 791 <NA>
#> 8 18 90 938 <NA>
#> 9 19 90 -7 <NA>
#> 10 19 80 85 b
#> 11 19 100 169 b
Created on 2018-11-12 by the reprex package (v0.2.1)
Thank you very much!
Upvotes: 1
Views: 48
Reputation: 14764
You could go for a full_join
if you would like to stay within tidyverse
framework (which otherwise does not support non-equi
joins):
library(dplyr)
db.x %>%
full_join(db.y) %>%
mutate(
response = if_else(time >= min & time <= max, as.character(response), NA_character_)
) %>% distinct(id, value, time, .keep_all = TRUE) %>%
select(-min, -max)
Output:
id value time response
1 18 60 -8 <NA>
2 18 75 85 a
3 18 100 203 a
4 18 100 259 <NA>
5 18 85 441 <NA>
6 18 80 623 <NA>
7 18 80 791 <NA>
8 18 90 938 <NA>
9 19 90 -7 <NA>
10 19 80 85 b
11 19 100 169 b
However, this is much more straightforward and scalable in data.table
:
library(data.table)
setDT(db.y)[setDT(db.x), on = .(id = id, min <= time, max >= time), .(id, value, time, response)]
Output:
id value time response
1: 18 60 -8 <NA>
2: 18 75 85 a
3: 18 100 203 a
4: 18 100 259 <NA>
5: 18 85 441 <NA>
6: 18 80 623 <NA>
7: 18 80 791 <NA>
8: 18 90 938 <NA>
9: 19 90 -7 <NA>
10: 19 80 85 b
11: 19 100 169 b
Comparison in terms of speed:
Unit: milliseconds
expr min lq mean median uq max neval
tidyverser 5.703497 6.369896 7.400882 7.033012 8.043276 12.162548 100
dt 1.812313 2.088171 2.506833 2.485092 2.958956 3.384321 100
Upvotes: 1