M. Beausoleil
M. Beausoleil

Reputation: 3555

Filter a grouped variable from a dataset based on the range values of another dataset using dplyr

I want to take the values of a (large) data frame:

library(tidyverse)
df.grid = expand.grid(x = letters, y = 1:60)
head(df.grid)

  x y
1 a 1
2 b 1
3 c 1
4 d 1
5 e 1
6 f 1
[...]

Which eventually reaches a 2, a 3, etc.

And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables

sub.data = data.frame(x = c("a","c","d"), min = c(2,50,25), max = c(6,53,30))

sub.data
  x min max
1 a   2   6
2 c  50  53
3 d  25  30

The output should look like something like this:

     x       y
1    a       2
2    a       3
3    a       4
4    a       5
5    a       6
6    c      50
7    c      51
8    c      52
9    c      53
10   d      25
11   d      26
12   d      27
13   d      28
14   d      29
15   d      30

I've tried this:

df.grid %>% 
  group_by(x) %>% 
  filter_if(y > sub.data$min)

But it doesn't work as the min column has multiple values and the 'if' part complains.

I also found this post, but it doesn't seem to work for me as there is no 'matching' variables to guide the filtering process.

I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.

Upvotes: 1

Views: 33

Answers (1)

akrun
akrun

Reputation: 887118

We could use a non-equi join

library(data.table)
setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 >= min, y1 <= max)]

-output

    x  y
 1: a  2
 2: a  3
 3: a  4
 4: a  5
 5: a  6
 6: c 50
 7: c 51
 8: c 52
 9: c 53
10: d 25
11: d 26
12: d 27
13: d 28
14: d 29
15: d 30

With dplyr version 1.1.0, we could also use non-equi joins with join_by

library(dplyr)
inner_join(df.grid, sub.data, by = join_by(x, y >= min , y <= max)) %>%
    select(x, y)

-output

   x  y
1  a  2
2  a  3
3  a  4
4  a  5
5  a  6
6  d 25
7  d 26
8  d 27
9  d 28
10 d 29
11 d 30
12 c 50
13 c 51
14 c 52
15 c 53

Or as @Davis Vaughan mentioned, use between with a left_joion

left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min, 
     x$max))) %>%
    select(names(df.grid))

Upvotes: 2

Related Questions