Arkning
Arkning

Reputation: 181

Is it possible to merge based on many conditions?

What I'd like to achieve is to be able to compare the data based on a date and if the date is between the range then take the lowest "PDF2" value.

This is an example of the two dataframe I'm working with. I want to check if the data of the column "R" from "df" is found in the column "R" of "df2" check if the date is between the range of df2 and if there is any conflict or duplicate I want to always keep the lowest value of "PDF2".

df <- data.frame("D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/12/2019"),
             "R" = c("ABC123", "ABC123", "ABC123", "ABC1"),
             "PDF" = c(1.23, 1.23, 1.23, 1.23),
             stringsAsFactors = FALSE)

df2 <- data.frame("DD" = c("01/01/2019", "01/02/2019", "01/01/2019"),
              "DF" = c("01/02/2019", "01/03/2019", "01/11/2019"),
              "R" = c("ABC123", "ABC123", "ABC1"),
              "PDF2" = c(1.12, 1.11, 1.12),
              stringsAsFactors = FALSE)

This is the result I'm expecting.

result <- data.frame("R" = c("ABC123", "ABC123", "ABC123"),
                 "D" = c("01/01/2019", "01/02/2019", "01/03/2019"),
                 "DD" = c("01/01/2019", "01/02/2019", "01/02/2019"),
                 "DF" = c("01/02/2019", "01/03/2019", "01/03/2019"),
                 "PDF" = c(1.23, 1.23, 1.23),
                 "PDF2" = c(1.12, 1.11, 1.11),
                 stringsAsFactors = FALSE)

As you can see the "ABC1" isn't in the result because the date isn't in the range.

My current problem is to only keep the lowest value if there are duplicate or conflict for the date range.

Here is a sample of my current code :

temp <- merge(df, df2, by = "R")
myd <- which(as.Date(temp$D, format = "%d/%m/%Y") <= as.Date(temp$DF, format = "%d/%m/%Y"))
myd2 <- which(as.Date(temp$D, format = "%d/%m/%Y") >= as.Date(temp$DD, format = "%d/%m/%Y"))
myd <- myd[myd %in% myd2]
if (length(myd)) {
  temp <- temp[myd,]
}

And also how can I get the lines that didn't match the requirements in a separate dataframe ?

Upvotes: 0

Views: 55

Answers (2)

J.P. Le Cavalier
J.P. Le Cavalier

Reputation: 1345

You can use the data.table package if you want something very efficient. The following code would do what you ask for

library(data.table)

setDT(df, key="R")
setDT(df2, key="R")

df[, D:=as.Date(D, format = "%d/%m/%Y")]
df2[, `:=`(
  DD = as.Date(DD, format = "%d/%m/%Y"),
  DF = as.Date(DF, format = "%d/%m/%Y")
)]

df[df2][D>=DD & D<=DF][, .(DD=max(DD), DF=max(DF), PDF2=PDF2[which.max(DD)]), .(D, R, PDF)]
##              D      R  PDF         DD         DF PDF2
##  1: 2019-01-01 ABC123 1.23 2019-01-01 2019-02-01 1.12
##  2: 2019-02-01 ABC123 1.23 2019-02-01 2019-03-01 1.11
##  3: 2019-03-01 ABC123 1.23 2019-02-01 2019-03-01 1.11

Upvotes: 0

Jeroen Colin
Jeroen Colin

Reputation: 345

I think the answers to this question might help you:

How to find matches for a row in a dataframe conditional on many rows from another dataframe

df %>% 
  left_join(df2, by = "R") %>% 
  filter(lubridate::dmy(D) >= lubridate::dmy(DD) & lubridate::dmy(D) <= lubridate::dmy(DF)) %>% 
  group_by(R,D) %>% 
  filter(PDF2 == min(PDF2)) %>% 
  ungroup()

Upvotes: 1

Related Questions