Reputation: 181
What I'd like to achieve is to be able to compare the data based on a date and if the date is between the range then take the lowest "PDF2" value.
This is an example of the two dataframe I'm working with. I want to check if the data of the column "R" from "df" is found in the column "R" of "df2" check if the date is between the range of df2 and if there is any conflict or duplicate I want to always keep the lowest value of "PDF2".
df <- data.frame("D" = c("01/01/2019", "01/02/2019", "01/03/2019", "01/12/2019"),
"R" = c("ABC123", "ABC123", "ABC123", "ABC1"),
"PDF" = c(1.23, 1.23, 1.23, 1.23),
stringsAsFactors = FALSE)
df2 <- data.frame("DD" = c("01/01/2019", "01/02/2019", "01/01/2019"),
"DF" = c("01/02/2019", "01/03/2019", "01/11/2019"),
"R" = c("ABC123", "ABC123", "ABC1"),
"PDF2" = c(1.12, 1.11, 1.12),
stringsAsFactors = FALSE)
This is the result I'm expecting.
result <- data.frame("R" = c("ABC123", "ABC123", "ABC123"),
"D" = c("01/01/2019", "01/02/2019", "01/03/2019"),
"DD" = c("01/01/2019", "01/02/2019", "01/02/2019"),
"DF" = c("01/02/2019", "01/03/2019", "01/03/2019"),
"PDF" = c(1.23, 1.23, 1.23),
"PDF2" = c(1.12, 1.11, 1.11),
stringsAsFactors = FALSE)
As you can see the "ABC1" isn't in the result because the date isn't in the range.
My current problem is to only keep the lowest value if there are duplicate or conflict for the date range.
Here is a sample of my current code :
temp <- merge(df, df2, by = "R")
myd <- which(as.Date(temp$D, format = "%d/%m/%Y") <= as.Date(temp$DF, format = "%d/%m/%Y"))
myd2 <- which(as.Date(temp$D, format = "%d/%m/%Y") >= as.Date(temp$DD, format = "%d/%m/%Y"))
myd <- myd[myd %in% myd2]
if (length(myd)) {
temp <- temp[myd,]
}
And also how can I get the lines that didn't match the requirements in a separate dataframe ?
Upvotes: 0
Views: 55
Reputation: 1345
You can use the data.table
package if you want something very efficient. The following code would do what you ask for
library(data.table)
setDT(df, key="R")
setDT(df2, key="R")
df[, D:=as.Date(D, format = "%d/%m/%Y")]
df2[, `:=`(
DD = as.Date(DD, format = "%d/%m/%Y"),
DF = as.Date(DF, format = "%d/%m/%Y")
)]
df[df2][D>=DD & D<=DF][, .(DD=max(DD), DF=max(DF), PDF2=PDF2[which.max(DD)]), .(D, R, PDF)]
## D R PDF DD DF PDF2
## 1: 2019-01-01 ABC123 1.23 2019-01-01 2019-02-01 1.12
## 2: 2019-02-01 ABC123 1.23 2019-02-01 2019-03-01 1.11
## 3: 2019-03-01 ABC123 1.23 2019-02-01 2019-03-01 1.11
Upvotes: 0
Reputation: 345
I think the answers to this question might help you:
How to find matches for a row in a dataframe conditional on many rows from another dataframe
df %>%
left_join(df2, by = "R") %>%
filter(lubridate::dmy(D) >= lubridate::dmy(DD) & lubridate::dmy(D) <= lubridate::dmy(DF)) %>%
group_by(R,D) %>%
filter(PDF2 == min(PDF2)) %>%
ungroup()
Upvotes: 1