Reputation: 215
I have two df I want to join
The first is a df with an employee's work schedule, with corresponding dates when the schedule was effective (effective date)
>df1 = schedule and effective date
ID Date Wrk_Schd
0001 8/16/2002 80.00
0001 2/27/2004 40.00
0001 2/1/2006 50.00
0001 7/1/2017 36.00
The second is a df with pay period dates and actual hours worked
>df2 = pay periods and actual hours
ID Date Wrk_Hrs
0001 9/9/2003 32.00
0001 10/8/2005 35.00
0001 10/21/2006 35.00
0001 12/21/2007 35.00
0001 9/9/2012 40.00
0001 10/9/2013 40.00
0001 12/9/2017 36.00
0001 12/21/2017 36.00
How would I join on ID and date so that the df with actual hours worked matches the work schedule for the applicable effective date?
Keep in mind the dates in df1 and df2 are not exactly the same. So the solution I am looking for would join based on whether the pay period was after the effective date, on the condition that there is not another effective date that might apply.
The desired result is as follows
>df3
ID Date Wrk_Hrs Wrk_Schd
0001 9/9/2003 32.00 80.00
0001 10/8/2005 35.00 40.00
0001 10/21/2006 35.00 50.00
0001 12/21/2007 35.00 50.00
0001 9/9/2012 40.00 50.00
0001 10/9/2013 40.00 50.00
0001 12/9/2017 36.00 36.00
0001 12/21/2017 36.00 36.00
Upvotes: 3
Views: 284
Reputation: 691
This code will do the trick. I changed df1$Date to df1$start.date, so the code will be clearer.
library(lubridate)
df1 <- data.frame(ID=c(1, 1, 1, 1),
start.date=c("8/16/2003", "2/27/2004",
"2/1/2006", "7/1/2017"),
Wrk_Schd=c(80.00, 40.00, 50.00, 36.00))
df2 <- data.frame(ID=c(1, 1, 1, 1, 1, 1, 1, 1),
Date=c("9/9/2003", "10/8/2005",
"10/21/2006", "12/21/2007",
"9/9/2012", "10/9/2013",
"12/9/2017", "12/21/2017"),
Wrk_Hrs=c(32.00, 35.00, 35.00, 35.00,
40.00, 40.00, 36.00, 36.00))
df1$start.date <- as.Date(df1$start.date, "%m/%d/%Y")
df2$Date <- as.Date(df2$Date, "%m/%d/%Y")
## These lines are just to expand the data to have
## more than one ID so the code can be better tested
## They can be discarding without affecting the code
df1 <- rbind(df1,data.frame(ID=df1$ID+1, start.date=df1$start.date+1, Wrk_Schd=df1$Wrk_Schd+10))
df2 <- rbind(df2,data.frame(ID=df2$ID+1, Date=df2$Date+1, Wrk_Hrs=df2$Wrk_Hrs+1))
## order and set end of periods
df1 <- df1[order(df1$ID,df1$start.date),] #order data
df1$end.date <- c(df1$start.date[-1]-1, today())
df1$end.date[df1$ID!=c(df1$ID[-1],df1$ID[1])] <- today() #set end of periods
## ID start.date Wrk_Schd end.date
##1 1 2003-08-16 80 2004-02-26
##2 1 2004-02-27 40 2006-01-31
##3 1 2006-02-01 50 2017-06-30
##4 1 2017-07-01 36 2018-01-27
##5 2 2003-08-17 90 2004-02-27
##6 2 2004-02-28 50 2006-02-01
##7 2 2006-02-02 60 2017-07-01
##8 2 2017-07-02 46 2018-01-27
## Assing Wrk_Schd to each of the jobs
df2$Wrk_Schd <-apply(df2, 1, function(x)(
df1$Wrk_Schd[which((df1$start.date<x["Date"])&
(df1$end.date>x["Date"])&
(df1$ID==x["ID"]))]
))
## ID Date Wrk_Hrs Wrk_Schd
##1 1 2003-09-09 32 80
##2 1 2005-10-08 35 40
##3 1 2006-10-21 35 50
##4 1 2007-12-21 35 50
##5 1 2012-09-09 40 50
##6 1 2013-10-09 40 50
##7 1 2017-12-09 36 36
##8 1 2017-12-21 36 36
##9 2 2003-09-10 33 90
##10 2 2005-10-09 36 50
##11 2 2006-10-22 36 60
##12 2 2007-12-22 36 60
##13 2 2012-09-10 41 60
##14 2 2013-10-10 41 60
##15 2 2017-12-10 37 46
##16 2 2017-12-22 37 46
I will also suggest another solution that may save a further step for you: Summing up the works which belong to a period.
df1$Total_Wrk_Hrs <- apply(df1,1,function(x)(
sum(df2$Wrk_Hrs[which((x["start.date"]<df2$Date)&
(x["end.date"]>df2$Date)&
(x["ID"]==df2$ID))]
)
))
## ID start.date Wrk_Schd end.date Total_Wrk_Hrs
##1 1 2003-08-16 80 2004-02-26 32
##2 1 2004-02-27 40 2006-01-31 35
##3 1 2006-02-01 50 2017-06-30 150
##4 1 2017-07-01 36 2018-01-27 72
##5 2 2003-08-17 90 2004-02-27 33
##6 2 2004-02-28 50 2006-02-01 36
##7 2 2006-02-02 60 2017-07-01 154
##8 2 2017-07-02 46 2018-01-27 74
Finally, I noticed you are not using Date format. You should convert it to be able compare dates. You can do it by running this code in every date column which is still not a date
df2$Date <- as.Date(df2$Date , format="%m/%d/%Y")
class(df2$Date) #check if is date
#[1] "Date"
Also, you should always provide reproducible codes. This means that by simply coping your code and running, I should be able to reproduce the error. Take a look at How to make a great R reproducible example? for more on this. By observing this, you increase (a lot) the changes of having a reply.
Upvotes: 1
Reputation: 20095
One possible solution could be using dplyr
and sqldf
.
# The data
df1 <- read.table(text = "ID Date Wrk_Schd
0001 08/16/2003 80.00
0001 02/27/2004 40.00
0001 02/01/2006 50.00
0001 07/01/2017 36.00", header = TRUE, stringsAsFactors = FALSE)
# Change Date column to date type
df1$Date <- as.Date(df1$Date, "%m/%d/%Y")
df2 <- read.table(text = "ID Date Wrk_Hrs
0001 09/09/2003 32.00
0001 10/08/2005 35.00
0001 10/21/2006 35.00
0001 12/21/2007 35.00
0001 09/09/2012 40.00
0001 10/09/2013 40.00
0001 12/09/2017 36.00
0001 12/21/2017 36.00", header = TRUE, stringsAsFactors = FALSE)
# Change Date column to date type
df2$Date <- as.Date(df2$Date, "%m/%d/%Y")
library(dplyr)
library(sqldf)
# Use lead function to add a column that show previous day of the next
schedule date
df1_Mod <- df1 %>%
arrange(ID, Date) %>%
group_by(ID) %>%
mutate(End_Date = lead(Date) - 1)
df1_Mod
# ID Date Wrk_Schd End_Date
#1 1 2003-08-16 80 2004-02-26
#2 1 2004-02-27 40 2006-01-31
#3 1 2006-02-01 50 2017-06-30
#4 1 2017-07-01 36 <NA>
#Join data.frames based on ID and Date between Date and End_Date
df3 <- sqldf("SELECT df2.ID, df2.Date, df2.Wrk_Hrs, df1_Mod.Wrk_Schd
FROM df2, df1_Mod
WHERE df2.ID = df1_Mod.ID AND
df2.Date >= df1_Mod.Date AND
(df1_Mod.End_Date IS NULL OR df2.Date <= df1_Mod.End_Date)")
df3
# ID Date Wrk_Hrs Wrk_Schd
#1 1 2003-09-09 32 80
#2 1 2005-10-08 35 40
#3 1 2006-10-21 35 50
#4 1 2007-12-21 35 50
#5 1 2012-09-09 40 50
#6 1 2013-10-09 40 50
#7 1 2017-12-09 36 36
#8 1 2017-12-21 36 36
Upvotes: 4
Reputation: 79228
You will have to create a new column that has both the effective and period dates and use that to merge the two dfs, then you can remove the columns if you want to
df1$Date=as.Date(df1$Date,format="%m/%d/%Y")
df2$Date=as.Date(df2$Date,format="%m/%d/%Y")
s=colSums(outer(df1$Date,df2$Date,`<`))
df3=df1[s,]
df3$d=df2$d=paste(df1$Date[s],df2$Date)
df=merge(df2,df3[-2])
df[order(df$Date),-2]
ID Date Wrk_Hrs Wrk_Schd
1 1 2003-09-09 32 80
2 1 2005-10-08 35 40
3 1 2006-10-21 35 50
4 1 2007-12-21 35 50
5 1 2012-09-09 40 50
6 1 2013-10-09 40 50
7 1 2017-12-09 36 36
8 1 2017-12-21 36 36
Upvotes: 0