Adi
Adi

Reputation: 55

R: Merge 2 Data Frame by Multiple Condition Using Dates & ID

I am trying to merge 2 data frame using multiple conditions and have used merge command but unable to get a successful output.

#Data Frame df1#
ID<- c("A1", "A2","A3", "A4")
Location <- c("012A","234B","012A","238C" )
startdate <- as.Date(c("2014-11-01","2014-01-01","2015-10-01", "2015-01-01"))
enddate <- as.Date(c("2014-12-31","2014-08-31","2015-12-31","2015-12-31"))
df1<- data.frame(ID,Location, startdate, enddate)

#Data Frame df2#
ID<-c("A1", "A1", "A4")
N<- c(2,1,2)
Loss_Date <- as.Date(c("2014-11-15", "2015-12-25", "2015-11-30"))
Amt<-c("2200","1000", "500")
df2<- data.frame(ID, N, Loss_Date,Amt)

I want to merge these 2 dataframe by using Location as common column and Loss_Date in df2 lies between (inclusive) Start_Date and End_Date in df2. You can see that second entry in df2 doesnt get mapped as the date is not in range of df1

#Required Output
ID<- c("A1", "A2","A3", "A4")
Location <- c("012A","234B","012A","238C" )
startdate <- as.Date(c("2014-11-01","2014-01-01","2015-10-01", "2015-01-01"))
enddate <- as.Date(c("2014-12-31","2014-08-31","2015-12-31","2015-12-31"))
N<-c(2,0,0,2)
Loss_Date <- c("2014-11-15", "NA", "NA", "2015-11-30")
Amt<-c("2200","0","0", "500")
Output<- data.frame(ID,Location, startdate, enddate,N, Loss_Date,Amt)

I created a common ID using the Year and ID but get the wrong mapping. Tried various ways to use merge and match but the command doesn work. I need this to run on over 170K observations. Both data frame are of unequal length. Any help would be really appreciated.

Upvotes: 4

Views: 10562

Answers (4)

Arun
Arun

Reputation: 118779

In the current development version of data.table (v1.9.7), non-equi joins are implemented. With that we can do:

require(data.table) # v1.9.7+
setDT(df2)[df1, .(ID, Location, startdate, enddate, N, x.Loss_Date, Amt), 
                      on=.(ID, Loss_Date>=startdate, Loss_Date<=enddate)]
#    ID Location  startdate    enddate  N x.Loss_Date  Amt
# 1: A1     012A 2014-11-01 2014-12-31  2  2014-11-15 2200
# 2: A2     234B 2014-01-01 2014-08-31 NA        <NA>   NA
# 3: A3     012A 2015-10-01 2015-12-31 NA        <NA>   NA
# 4: A4     238C 2015-01-01 2015-12-31  2  2015-11-30  500

Upvotes: 4

Wael Hussein
Wael Hussein

Reputation: 155

sqldf is very robust and easy to read. Check this code out:

library(sqldf)
Output<-sqldf("
           SELECT L.*, r.N, r.Loss_Date, r.Amt
           FROM df1 as L
           LEFT JOIN df2 as r
           ON 
           L.ID=r.ID AND
              r.Loss_Date BETWEEN L.startdate AND L.enddate
           ORDER BY L.ID")

where "L" represents df1 (that's the df1 as l), and "r" represents df2 (df2 as r).

Upvotes: 1

eivicent
eivicent

Reputation: 273

I've done the merge using package dplyr which is really fast and easy to use.

You should add to your data frame definition this stringsAsFactors=F

 df1<- data.frame(ID,Location, startdate, enddate, stringsAsFactors = F)
 df2<- data.frame(ID, N, Loss_Date,Amt, stringsAsFactors = F)

So your character inputs don't get changed to factors and they don't give you undesired results

install.packages("dplyr")
library(dplyr)

output <- full_join(df1, df2, by="ID") %>% 
filter(Loss_Date >= startdate & Loss_Date <= enddate)

Output:

  ID Location  startdate    enddate N  Loss_Date  Amt
1 A1     012A 2014-11-01 2014-12-31 2 2014-11-15 2200
2 A4     238C 2015-01-01 2015-12-31 2 2015-11-30  500

Again, as appointed by the comments, if you want to preserve the rows that doesn't match the condition, you should use another function:

output2 <- left_join(df1, df2, by="ID") %>% 
 mutate(condition = (Loss_Date >= startdate & Loss_Date <= enddate)) %>%
 mutate(N = ifelse(condition & !is.na(condition), N, 0)) %>%
 mutate(Loss_Date = as.Date(ifelse(condition, Loss_Date, NA),origin="1970-01-01")) %>%
 mutate(Amt = ifelse(condition & !is.na(condition), Amt, 0)) %>%
 mutate(condition = ifelse(is.na(condition),T,condition)) %>%
 filter(condition) %>%
 select(-condition)

First create a new column that matches the condition, and then change the other columns to 0or NA according to that condition. Lastly, unselect the new generated column. (Notice that ifelse changes the class of Date to numericso an as.Date is needed)

  ID Location  startdate    enddate N  Loss_Date  Amt
1 A1     012A 2014-11-01 2014-12-31 2 2014-11-15 2200
2 A2     234B 2014-01-01 2014-08-31 0       <NA>    0
3 A3     012A 2015-10-01 2015-12-31 0       <NA>    0
4 A4     238C 2015-01-01 2015-12-31 2 2015-11-30  50

Upvotes: 3

AntoniosK
AntoniosK

Reputation: 16121

I've just added a little bit of extra code in @VincentBoned 's answer.

# create 1st dataframe
ID<- c("A1", "A2","A3", "A4")
Location <- c("012A","234B","012A","238C" )
startdate <- as.Date(c("2014-11-01","2014-01-01","2015-10-01", "2015-01-01"))
enddate <- as.Date(c("2014-12-31","2014-08-31","2015-12-31","2015-12-31"))

df1<- data.frame(ID,Location, startdate, enddate, stringsAsFactors = F)


# create 2nd dataframe
ID<-c("A1", "A1", "A4")
N<- c(2,1,2)
Loss_Date <- as.Date(c("2014-11-15", "2015-12-25", "2015-11-30"))
Amt<-c("2200","1000", "500")

df2<- data.frame(ID, N, Loss_Date,Amt, stringsAsFactors = F)


library(dplyr)

full_join(df1, df2, by="ID") %>% 
  mutate(condition = (Loss_Date >= startdate & Loss_Date <= enddate)) %>%
  mutate(N = ifelse(condition & !is.na(condition), N, 0)) %>%
  mutate(Loss_Date = as.Date(ifelse(condition, Loss_Date, NA),origin="1970-01-01")) %>%
  mutate(Amt = ifelse(condition & !is.na(condition), Amt, 0)) %>%
  select(-condition) %>%
  group_by(ID) %>%                              # for each ID
  mutate(Nrows = n()) %>%                       # count how many rows they have in the final table
  ungroup() %>%
  filter(!(Nrows > 1 & is.na(Loss_Date))) %>%   # filter out rows with IDs that have more than 1 rows and those rows are not matched
  select(-Nrows)

#   ID Location  startdate    enddate N  Loss_Date  Amt 
# 1 A1     012A 2014-11-01 2014-12-31 2 2014-11-15 2200 
# 2 A2     234B 2014-01-01 2014-08-31 0       <NA>    0 
# 3 A3     012A 2015-10-01 2015-12-31 0       <NA>    0 
# 4 A4     238C 2015-01-01 2015-12-31 2 2015-11-30  500 

If you understood how the above code works (step by step) you can use a more compact version that returns the same result:

full_join(df1, df2, by="ID") %>% 
  mutate(condition = (Loss_Date >= startdate & Loss_Date <= enddate),
         N = ifelse(condition & !is.na(condition), N, 0),
         Loss_Date = as.Date(ifelse(condition, Loss_Date, NA),origin="1970-01-01"),
         Amt = ifelse(condition & !is.na(condition), Amt, 0)) %>%
  group_by(ID) %>%                             
  mutate(Nrows = n()) %>%                      
  filter(!(Nrows > 1 & is.na(Loss_Date))) %>%
  select(-c(condition, Nrows))

Upvotes: 3

Related Questions