Filter data frame based on multiple conditions in another data frame

Question

I have two data frames:

df where for single x there are multiple values y
matchDf which defines constraints I would like to use to filter df

I would like to filter only rows where y is lower or equal to max_y given that x falls in interval



Is there any way to do this in R (without SQL I used below)?

df <- read.table(header = TRUE, text = '
x y
100 0.1
100 0.2
100 0.3
250 0.2
250 0.3
250 0.4
375 0.2
375 0.25
375 0.35
420 0.15
420 0.16
420 0.17
500 0.23
500 0.55')

matchDf <- read.table(header = TRUE, text = '
x_from x_to max_y
0 300 .2
300 500 .3
500 99999 .5
')

library(sqldf)

sqldf('select a.* 
      from 
        df a 
        join matchDf b on (a.x >= b.x_from 
                           and a.x < b.x_to 
                           and a.y <= b.max_y)'
      )

Thomas · Accepted Answer

Try this:

df[df$y <= matchDf$max_y[cut(df$x, c(0,matchDf$x_to))],]
     x    y
1  100 0.10
2  100 0.20
4  250 0.20
7  375 0.20
8  375 0.25
10 420 0.15
11 420 0.16
12 420 0.17
13 500 0.23

What's happening here is that cut is giving you which row in matchDf each observation in df should be in. Then, you just use that as a positional row extraction vector, and state the conditional relationship for y using <=.

To see how cut works, just pull it out of the expression:

> cut(df$x, c(0,matchDf$x_to))
 [1] (0,300]   (0,300]   (0,300]   (0,300]   (0,300]   (0,300]   (300,500] (300,500] (300,500] (300,500] (300,500] (300,500] (300,500] (300,500]
Levels: (0,300] (300,500] (500,1e+05]

The level labels are irrelevant, because [ uses the underlying integer values to extract.

Filter data frame based on multiple conditions in another data frame

Answers (2)

Related Questions