Reputation: 55
I am trying to merge 2 data frame using multiple conditions and have used merge command but unable to get a successful output.
#Data Frame df1#
ID<- c("A1", "A2","A3", "A4")
Location <- c("012A","234B","012A","238C" )
startdate <- as.Date(c("2014-11-01","2014-01-01","2015-10-01", "2015-01-01"))
enddate <- as.Date(c("2014-12-31","2014-08-31","2015-12-31","2015-12-31"))
df1<- data.frame(ID,Location, startdate, enddate)
#Data Frame df2#
ID<-c("A1", "A1", "A4")
N<- c(2,1,2)
Loss_Date <- as.Date(c("2014-11-15", "2015-12-25", "2015-11-30"))
Amt<-c("2200","1000", "500")
df2<- data.frame(ID, N, Loss_Date,Amt)
I want to merge these 2 dataframe by using Location as common column and Loss_Date in df2 lies between (inclusive) Start_Date and End_Date in df2. You can see that second entry in df2 doesnt get mapped as the date is not in range of df1
#Required Output
ID<- c("A1", "A2","A3", "A4")
Location <- c("012A","234B","012A","238C" )
startdate <- as.Date(c("2014-11-01","2014-01-01","2015-10-01", "2015-01-01"))
enddate <- as.Date(c("2014-12-31","2014-08-31","2015-12-31","2015-12-31"))
N<-c(2,0,0,2)
Loss_Date <- c("2014-11-15", "NA", "NA", "2015-11-30")
Amt<-c("2200","0","0", "500")
Output<- data.frame(ID,Location, startdate, enddate,N, Loss_Date,Amt)
I created a common ID using the Year and ID but get the wrong mapping. Tried various ways to use merge and match but the command doesn work. I need this to run on over 170K observations. Both data frame are of unequal length. Any help would be really appreciated.
Upvotes: 4
Views: 10562
Reputation: 118779
In the current development version of data.table (v1.9.7), non-equi joins are implemented. With that we can do:
require(data.table) # v1.9.7+
setDT(df2)[df1, .(ID, Location, startdate, enddate, N, x.Loss_Date, Amt),
on=.(ID, Loss_Date>=startdate, Loss_Date<=enddate)]
# ID Location startdate enddate N x.Loss_Date Amt
# 1: A1 012A 2014-11-01 2014-12-31 2 2014-11-15 2200
# 2: A2 234B 2014-01-01 2014-08-31 NA <NA> NA
# 3: A3 012A 2015-10-01 2015-12-31 NA <NA> NA
# 4: A4 238C 2015-01-01 2015-12-31 2 2015-11-30 500
Upvotes: 4
Reputation: 155
sqldf is very robust and easy to read. Check this code out:
library(sqldf)
Output<-sqldf("
SELECT L.*, r.N, r.Loss_Date, r.Amt
FROM df1 as L
LEFT JOIN df2 as r
ON
L.ID=r.ID AND
r.Loss_Date BETWEEN L.startdate AND L.enddate
ORDER BY L.ID")
where "L" represents df1 (that's the df1 as l), and "r" represents df2 (df2 as r).
Upvotes: 1
Reputation: 273
I've done the merge using package dplyr
which is really fast and easy to use.
You should add to your data frame definition this stringsAsFactors=F
df1<- data.frame(ID,Location, startdate, enddate, stringsAsFactors = F)
df2<- data.frame(ID, N, Loss_Date,Amt, stringsAsFactors = F)
So your character inputs don't get changed to factors and they don't give you undesired results
install.packages("dplyr")
library(dplyr)
output <- full_join(df1, df2, by="ID") %>%
filter(Loss_Date >= startdate & Loss_Date <= enddate)
Output:
ID Location startdate enddate N Loss_Date Amt
1 A1 012A 2014-11-01 2014-12-31 2 2014-11-15 2200
2 A4 238C 2015-01-01 2015-12-31 2 2015-11-30 500
Again, as appointed by the comments, if you want to preserve the rows that doesn't match the condition, you should use another function:
output2 <- left_join(df1, df2, by="ID") %>%
mutate(condition = (Loss_Date >= startdate & Loss_Date <= enddate)) %>%
mutate(N = ifelse(condition & !is.na(condition), N, 0)) %>%
mutate(Loss_Date = as.Date(ifelse(condition, Loss_Date, NA),origin="1970-01-01")) %>%
mutate(Amt = ifelse(condition & !is.na(condition), Amt, 0)) %>%
mutate(condition = ifelse(is.na(condition),T,condition)) %>%
filter(condition) %>%
select(-condition)
First create a new column that matches the condition, and then change the other columns to 0
or NA
according to that condition. Lastly, unselect the new generated column. (Notice that ifelse
changes the class of Date
to numeric
so an as.Date
is needed)
ID Location startdate enddate N Loss_Date Amt
1 A1 012A 2014-11-01 2014-12-31 2 2014-11-15 2200
2 A2 234B 2014-01-01 2014-08-31 0 <NA> 0
3 A3 012A 2015-10-01 2015-12-31 0 <NA> 0
4 A4 238C 2015-01-01 2015-12-31 2 2015-11-30 50
Upvotes: 3
Reputation: 16121
I've just added a little bit of extra code in @VincentBoned 's answer.
# create 1st dataframe
ID<- c("A1", "A2","A3", "A4")
Location <- c("012A","234B","012A","238C" )
startdate <- as.Date(c("2014-11-01","2014-01-01","2015-10-01", "2015-01-01"))
enddate <- as.Date(c("2014-12-31","2014-08-31","2015-12-31","2015-12-31"))
df1<- data.frame(ID,Location, startdate, enddate, stringsAsFactors = F)
# create 2nd dataframe
ID<-c("A1", "A1", "A4")
N<- c(2,1,2)
Loss_Date <- as.Date(c("2014-11-15", "2015-12-25", "2015-11-30"))
Amt<-c("2200","1000", "500")
df2<- data.frame(ID, N, Loss_Date,Amt, stringsAsFactors = F)
library(dplyr)
full_join(df1, df2, by="ID") %>%
mutate(condition = (Loss_Date >= startdate & Loss_Date <= enddate)) %>%
mutate(N = ifelse(condition & !is.na(condition), N, 0)) %>%
mutate(Loss_Date = as.Date(ifelse(condition, Loss_Date, NA),origin="1970-01-01")) %>%
mutate(Amt = ifelse(condition & !is.na(condition), Amt, 0)) %>%
select(-condition) %>%
group_by(ID) %>% # for each ID
mutate(Nrows = n()) %>% # count how many rows they have in the final table
ungroup() %>%
filter(!(Nrows > 1 & is.na(Loss_Date))) %>% # filter out rows with IDs that have more than 1 rows and those rows are not matched
select(-Nrows)
# ID Location startdate enddate N Loss_Date Amt
# 1 A1 012A 2014-11-01 2014-12-31 2 2014-11-15 2200
# 2 A2 234B 2014-01-01 2014-08-31 0 <NA> 0
# 3 A3 012A 2015-10-01 2015-12-31 0 <NA> 0
# 4 A4 238C 2015-01-01 2015-12-31 2 2015-11-30 500
If you understood how the above code works (step by step) you can use a more compact version that returns the same result:
full_join(df1, df2, by="ID") %>%
mutate(condition = (Loss_Date >= startdate & Loss_Date <= enddate),
N = ifelse(condition & !is.na(condition), N, 0),
Loss_Date = as.Date(ifelse(condition, Loss_Date, NA),origin="1970-01-01"),
Amt = ifelse(condition & !is.na(condition), Amt, 0)) %>%
group_by(ID) %>%
mutate(Nrows = n()) %>%
filter(!(Nrows > 1 & is.na(Loss_Date))) %>%
select(-c(condition, Nrows))
Upvotes: 3