Pnd
Pnd

Reputation: 3

R compare values from one dataframe from values from other dataframe

I am quite new on R so still learning many things. I have been searching around but I cannot find the appropriate answer to my problem. I have these two datasets:

d1
    Criteria Order Low High
1        a     1   0   10
2        a     1  11   20
3        a     1  21   30
4        b     1   0   13
5        b     1  14   32
6        a     2   5   22
7        a     2   0    4
8        b     2   0   18

and then d2

 Criteria Order Final
1        a     1    13
2        b     2    12
3        a     1     8
4        a     2     2 

I was wondering if there is any way to add an extra column to d1when d2$Final is within d1$Low and d1$High and both Criteria and Order match. What I am expecting to obtain would be something like:

 Criteria Order Low High Final
1        a     1   0   10     8
2        a     1  11   20    13
3        a     1  21   30    NA
4        b     1   0   13    NA
5        b     1  14   32    NA
6        a     2   5   22    NA
7        a     2   0    4     2
8        b     2   0   18    12  

Or even a numeric output 1/0 for true or false in the Final column would be ok.

Thanks in advance

Upvotes: 0

Views: 51

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269371

This uses SQL to create a complex join. The need for [...] around Order is to distinguish it from the keyword of the same name.

library(sqldf)

sqldf("select d1.*, d2.Final
  from d1 
  left join d2 on d1.Criteria = d2.Criteria and
                  d1.[Order] = d2.[Order] and
                  d2.Final between d1.Low and d1.High")

giving the same output shown in the question:

  Criteria Order Low High Final
1        a     1   0   10     8
2        a     1  11   20    13
3        a     1  21   30    NA
4        b     1   0   13    NA
5        b     1  14   32    NA
6        a     2   5   22    NA
7        a     2   0    4     2
8        b     2   0   18    12

Note

Data in reproducible form:

Lines1 <- "
    Criteria Order Low High
1        a     1   0   10
2        a     1  11   20
3        a     1  21   30
4        b     1   0   13
5        b     1  14   32
6        a     2   5   22
7        a     2   0    4
8        b     2   0   18"

Lines2 <- "
  Criteria Order Final
1        a     1    13
2        b     2    12
3        a     1     8
4        a     2     2"

d1 <- read.table(text = Lines1)
d2 <- read.table(text = Lines2)

Upvotes: 2

r2evans
r2evans

Reputation: 160407

If your data is "large-ish", this solution will not be for you: the cartesian join will explode beyond what a "standard" computer will tolerate memory-wise.

If, however, your data is small enough (very relative term), then you can do a cartesian-join (also known as a full or full outer join) and filter the results. (This solution is an implementation of one section from https://www.mango-solutions.com/blog/in-between-a-rock-and-a-conditional-join. There are other sections that discuss SQL and fuzzyjoin, both worthy candidates.)

Three dialects, depending on your preference.

Base R

a <- merge(d2, d1, all.x=T)
a <- transform(a, Final = ifelse(Low <= Final & Final <= High, Final, NA))
a[!duplicated(a),]
#   Criteria Order Final Low High
# 1        a     1    NA   0   10
# 2        a     1    13  11   20
# 3        a     1    NA  21   30
# 4        a     1     8   0   10
# 5        a     1    NA  11   20
# 7        a     2    NA   5   22
# 8        a     2     2   0    4
# 9        b     2    12   0   18

It has an extra row, trying to work that elegantly ...

dplyr

library(dplyr)
full_join(d1, d2) %>%
  mutate(Final = if_else(between(Final, Low, High), Final, NA_integer_)) %>%
  group_by(Criteria, Order, Low, High) %>%
  summarise(Final = coalesce(Final)[1]) %>%
  ungroup()
# Joining, by = c("Criteria", "Order")
# # A tibble: 8 x 5
#   Criteria Order   Low  High Final
#   <chr>    <int> <int> <int> <int>
# 1 a            1     0    10    NA
# 2 a            1    11    20    13
# 3 a            1    21    30    NA
# 4 a            2     0     4     2
# 5 a            2     5    22    NA
# 6 b            1     0    13    NA
# 7 b            1    14    32    NA
# 8 b            2     0    18    12

data.table

library(data.table)
as.data.table(d2)[d1, on = .(Final > Low, Final < High, Criteria, Order),
                  .(Criteria, Order, Low, High, x.Final)]
#    Criteria Order Low High x.Final
# 1:        a     1   0   10       8
# 2:        a     1  11   20      13
# 3:        a     1  21   30      NA
# 4:        b     1   0   13      NA
# 5:        b     1  14   32      NA
# 6:        a     2   5   22      NA
# 7:        a     2   0    4       2
# 8:        b     2   0   18      12

(There is also a solution using data.table::foverlaps that might be faster or more memory-frugal. Read the link, it's quite helpful.)


Data:

d1 <- structure(list(Criteria = c("a", "a", "a", "b", "b", "a", "a", 
"b"), Order = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), Low = c(0L, 
11L, 21L, 0L, 14L, 5L, 0L, 0L), High = c(10L, 20L, 30L, 13L, 
32L, 22L, 4L, 18L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8"))
d2 <- structure(list(Criteria = c("a", "b", "a", "a"), Order = c(1L, 
2L, 1L, 2L), Final = c(13L, 12L, 8L, 2L)), class = "data.frame", row.names = c("1", 
"2", "3", "4"))

Upvotes: 1

Related Questions