Ashirwad
Ashirwad

Reputation: 2040

Filtering records in one data frame based on another by matching multiple criteria

I have the following two data frames, dat1 and dat2:

library(tidyverse)
dat1 <- tribble(
  ~"subj", ~"drive", ~"measure",
  "A", 1, 1,
  "A", 1, 2,
  "A", 1, 3,
  "A", 1, 4,
  "A", 1, 5,
  "A", 2, 1,
  "A", 2, 2,
  "A", 2, 3,
  "A", 2, 4,
  "A", 2, 5,
  "B", 1, 1,
  "B", 1, 2,
  "B", 1, 3,
  "B", 1, 4,
  "B", 1, 5,
  "B", 2, 1,
  "B", 2, 2,
  "B", 2, 3,
  "B", 2, 4,
  "B", 2, 5,
)

dat2 <- tribble(
  ~"subj", ~"drive", ~"measure",
  "A", 1, 3,
  "B", 2, 4
)

And I am trying to filter the records in dat1 based on the following conditions:

For this example, say the range is one unit apart. So, my result data frame will look like the following:

result <- tribble(
  ~"subj", ~"drive", ~"measure",
  "A", 1, 2,
  "A", 1, 3,
  "A", 1, 4,
  "B", 2, 3,
  "B", 2, 4,
  "B", 2, 5
)

I am aware of dplyr::semi_join() but it doesn't let me filter based on a range. Any ideas how I can solve this problem? Tidyverse based solutions will be great!

Upvotes: 2

Views: 85

Answers (3)

IceCreamToucan
IceCreamToucan

Reputation: 28685

Edited to use native sqldf string substitution as mentioned by GG's comment instead of sprintf.

library(sqldf)

check_range <- 1

fn$sqldf('
select  one.*
from    dat1 one
        join dat2 two
          on  one.subj = two.subj
              and one.drive = two.drive
              and one.measure - two.measure between -`check_range` and `check_range`
')
#   subj drive measure
# 1    A     1       2
# 2    A     1       3
# 3    A     1       4
# 4    B     2       3
# 5    B     2       4
# 6    B     2       5

Upvotes: 3

Uwe
Uwe

Reputation: 42544

For the sake of completeness, here is also a solution which uses a non-equi join:

library(data.table)
range <- 1
idx <- setDT(dat1)[
  setDT(dat2)[, .(subj, drive, lower = measure - range, upper = measure + range)], 
  on = .(subj, drive, measure >= lower, measure <= upper), which = TRUE]
dat1[idx]
   subj drive measure
1:    A     1       2
2:    A     1       3
3:    A     1       4
4:    B     2       3
5:    B     2       4
6:    B     2       5

Upvotes: 1

akrun
akrun

Reputation: 887118

An option is to first do an inner_join and then use between

library(dplyr)
inner_join(dat1, dat2, by = c('subj', 'drive')) %>% 
    group_by(subj, drive) %>% 
    filter(between(measure.x, first(measure.y)-1, first(measure.y) + 1)) %>% 
    select(measure = measure.x)
# A tibble: 6 x 3
# Groups:   subj, drive [2]
#  subj  drive measure
#  <chr> <dbl>   <dbl>
#1 A         1       2
#2 A         1       3
#3 A         1       4
#4 B         2       3
#5 B         2       4
#6 B         2       5

or with data.table

library(data.table)
setDT(dat1)[setDT(dat2), .SD[between(measure, i.measure -1,
          i.measure + 1)], on = .(subj, drive), by = .EACHI]
#    subj drive measure
#1:    A     1       2
#2:    A     1       3
#3:    A     1       4
#4:    B     2       3
#5:    B     2       4
#6:    B     2       5

Upvotes: 3

Related Questions