Tomas Greif
Tomas Greif

Reputation: 22623

Filter data frame based on multiple conditions in another data frame

I have two data frames:

I would like to filter only rows where y is lower or equal to max_y given that x falls in interval <x_from, x_to)

Is there any way to do this in R (without SQL I used below)?

df <- read.table(header = TRUE, text = '
x y
100 0.1
100 0.2
100 0.3
250 0.2
250 0.3
250 0.4
375 0.2
375 0.25
375 0.35
420 0.15
420 0.16
420 0.17
500 0.23
500 0.55')

matchDf <- read.table(header = TRUE, text = '
x_from x_to max_y
0 300 .2
300 500 .3
500 99999 .5
')

library(sqldf)

sqldf('select a.* 
      from 
        df a 
        join matchDf b on (a.x >= b.x_from 
                           and a.x < b.x_to 
                           and a.y <= b.max_y)'
      )

Upvotes: 4

Views: 168

Answers (2)

Cath
Cath

Reputation: 24074

You can do:

df[mapply(function(x, y) {
               y <= matchDf$max_y[x >=matchDf$x_from  & x < matchDf$x_to]
            }, x=df$x, y=df$y), ]
     # x    y
# 1  100 0.10
# 2  100 0.20
# 4  250 0.20
# 7  375 0.20
# 8  375 0.25
# 10 420 0.15
# 11 420 0.16
# 12 420 0.17
# 13 500 0.23

The function mapply permits, for each couple (x,y), to know if y is lower or equal to the appropriate max_y value and is applied to each "couple" of df and return TRUE or FALSE, then df is subsetted according to the mapply result.

Upvotes: 3

Thomas
Thomas

Reputation: 44527

Try this:

df[df$y <= matchDf$max_y[cut(df$x, c(0,matchDf$x_to))],]
     x    y
1  100 0.10
2  100 0.20
4  250 0.20
7  375 0.20
8  375 0.25
10 420 0.15
11 420 0.16
12 420 0.17
13 500 0.23

What's happening here is that cut is giving you which row in matchDf each observation in df should be in. Then, you just use that as a positional row extraction vector, and state the conditional relationship for y using <=.

To see how cut works, just pull it out of the expression:

> cut(df$x, c(0,matchDf$x_to))
 [1] (0,300]   (0,300]   (0,300]   (0,300]   (0,300]   (0,300]   (300,500] (300,500] (300,500] (300,500] (300,500] (300,500] (300,500] (300,500]
Levels: (0,300] (300,500] (500,1e+05]

The level labels are irrelevant, because [ uses the underlying integer values to extract.

Upvotes: 4

Related Questions