samuelt
samuelt

Reputation: 215

Joining data frames on unique id and conditional dates

I have two df I want to join

The first is a df with an employee's work schedule, with corresponding dates when the schedule was effective (effective date)

>df1 = schedule and effective date
ID      Date        Wrk_Schd
0001    8/16/2002   80.00
0001    2/27/2004   40.00
0001    2/1/2006    50.00
0001    7/1/2017    36.00

The second is a df with pay period dates and actual hours worked

>df2 = pay periods and actual hours
ID      Date        Wrk_Hrs
0001    9/9/2003    32.00
0001    10/8/2005   35.00   
0001    10/21/2006  35.00
0001    12/21/2007  35.00   
0001    9/9/2012    40.00
0001    10/9/2013   40.00
0001    12/9/2017   36.00
0001    12/21/2017  36.00

How would I join on ID and date so that the df with actual hours worked matches the work schedule for the applicable effective date?

Keep in mind the dates in df1 and df2 are not exactly the same. So the solution I am looking for would join based on whether the pay period was after the effective date, on the condition that there is not another effective date that might apply.

The desired result is as follows

>df3 
ID      Date        Wrk_Hrs Wrk_Schd
0001    9/9/2003    32.00   80.00
0001    10/8/2005   35.00   40.00
0001    10/21/2006  35.00   50.00
0001    12/21/2007  35.00   50.00
0001    9/9/2012    40.00   50.00
0001    10/9/2013   40.00   50.00
0001    12/9/2017   36.00   36.00
0001    12/21/2017  36.00   36.00

Upvotes: 3

Views: 284

Answers (3)

rvbarreto
rvbarreto

Reputation: 691

This code will do the trick. I changed df1$Date to df1$start.date, so the code will be clearer.

library(lubridate)
df1 <- data.frame(ID=c(1, 1, 1, 1),
                  start.date=c("8/16/2003", "2/27/2004",
                               "2/1/2006", "7/1/2017"),
                  Wrk_Schd=c(80.00, 40.00, 50.00, 36.00))

df2 <- data.frame(ID=c(1, 1, 1, 1, 1, 1, 1, 1),
                  Date=c("9/9/2003", "10/8/2005",
                         "10/21/2006", "12/21/2007",
                         "9/9/2012", "10/9/2013",
                         "12/9/2017", "12/21/2017"),
                  Wrk_Hrs=c(32.00, 35.00, 35.00, 35.00,
                            40.00, 40.00, 36.00, 36.00))

df1$start.date <- as.Date(df1$start.date, "%m/%d/%Y")
df2$Date <- as.Date(df2$Date, "%m/%d/%Y")

## These lines are just to expand the data to have
## more than one ID so the code can be better tested
## They can be discarding without affecting the code
df1 <- rbind(df1,data.frame(ID=df1$ID+1, start.date=df1$start.date+1, Wrk_Schd=df1$Wrk_Schd+10))
df2 <- rbind(df2,data.frame(ID=df2$ID+1, Date=df2$Date+1, Wrk_Hrs=df2$Wrk_Hrs+1))

## order and set end of periods
df1 <- df1[order(df1$ID,df1$start.date),] #order data
df1$end.date <- c(df1$start.date[-1]-1, today())
df1$end.date[df1$ID!=c(df1$ID[-1],df1$ID[1])] <- today() #set end of periods

##  ID start.date Wrk_Schd   end.date
##1  1 2003-08-16       80 2004-02-26
##2  1 2004-02-27       40 2006-01-31
##3  1 2006-02-01       50 2017-06-30
##4  1 2017-07-01       36 2018-01-27
##5  2 2003-08-17       90 2004-02-27
##6  2 2004-02-28       50 2006-02-01
##7  2 2006-02-02       60 2017-07-01
##8  2 2017-07-02       46 2018-01-27

## Assing Wrk_Schd to each of the jobs
df2$Wrk_Schd <-apply(df2, 1, function(x)(
    df1$Wrk_Schd[which((df1$start.date<x["Date"])&
                       (df1$end.date>x["Date"])&
                       (df1$ID==x["ID"]))]
))

##   ID       Date Wrk_Hrs Wrk_Schd
##1   1 2003-09-09      32       80
##2   1 2005-10-08      35       40
##3   1 2006-10-21      35       50
##4   1 2007-12-21      35       50
##5   1 2012-09-09      40       50
##6   1 2013-10-09      40       50
##7   1 2017-12-09      36       36
##8   1 2017-12-21      36       36
##9   2 2003-09-10      33       90
##10  2 2005-10-09      36       50
##11  2 2006-10-22      36       60
##12  2 2007-12-22      36       60
##13  2 2012-09-10      41       60
##14  2 2013-10-10      41       60
##15  2 2017-12-10      37       46
##16  2 2017-12-22      37       46

I will also suggest another solution that may save a further step for you: Summing up the works which belong to a period.

df1$Total_Wrk_Hrs <- apply(df1,1,function(x)(
    sum(df2$Wrk_Hrs[which((x["start.date"]<df2$Date)&
                          (x["end.date"]>df2$Date)&
                          (x["ID"]==df2$ID))]
        )
))

##  ID start.date Wrk_Schd   end.date Total_Wrk_Hrs
##1  1 2003-08-16       80 2004-02-26         32
##2  1 2004-02-27       40 2006-01-31         35
##3  1 2006-02-01       50 2017-06-30        150
##4  1 2017-07-01       36 2018-01-27         72
##5  2 2003-08-17       90 2004-02-27         33
##6  2 2004-02-28       50 2006-02-01         36
##7  2 2006-02-02       60 2017-07-01        154
##8  2 2017-07-02       46 2018-01-27         74 

Finally, I noticed you are not using Date format. You should convert it to be able compare dates. You can do it by running this code in every date column which is still not a date

df2$Date <- as.Date(df2$Date , format="%m/%d/%Y")
class(df2$Date) #check if is date
#[1] "Date"

Also, you should always provide reproducible codes. This means that by simply coping your code and running, I should be able to reproduce the error. Take a look at How to make a great R reproducible example? for more on this. By observing this, you increase (a lot) the changes of having a reply.

Upvotes: 1

MKR
MKR

Reputation: 20095

One possible solution could be using dplyr and sqldf.

# The data
df1 <- read.table(text = "ID      Date        Wrk_Schd
0001    08/16/2003   80.00
0001    02/27/2004   40.00
0001    02/01/2006    50.00
0001    07/01/2017    36.00", header = TRUE, stringsAsFactors = FALSE)

# Change Date column to date type
df1$Date <- as.Date(df1$Date, "%m/%d/%Y")

df2 <- read.table(text = "ID      Date        Wrk_Hrs
0001    09/09/2003    32.00
0001    10/08/2005   35.00   
0001    10/21/2006  35.00
0001    12/21/2007  35.00   
0001    09/09/2012    40.00
0001    10/09/2013   40.00
0001    12/09/2017   36.00
0001    12/21/2017  36.00", header = TRUE, stringsAsFactors = FALSE)

# Change Date column to date type 
df2$Date <- as.Date(df2$Date, "%m/%d/%Y")


library(dplyr)
library(sqldf)
# Use lead function to add a column that show previous day of the next
  schedule date
df1_Mod <- df1 %>%
  arrange(ID, Date) %>%
  group_by(ID) %>%
  mutate(End_Date = lead(Date) - 1)

df1_Mod
#  ID       Date Wrk_Schd   End_Date
#1  1 2003-08-16       80 2004-02-26
#2  1 2004-02-27       40 2006-01-31
#3  1 2006-02-01       50 2017-06-30
#4  1 2017-07-01       36       <NA>

#Join data.frames based on ID and Date between Date and End_Date

df3 <- sqldf("SELECT df2.ID, df2.Date, df2.Wrk_Hrs, df1_Mod.Wrk_Schd 
              FROM df2, df1_Mod
              WHERE df2.ID = df1_Mod.ID AND
              df2.Date >= df1_Mod.Date AND
             (df1_Mod.End_Date IS NULL OR df2.Date <= df1_Mod.End_Date)")

df3
#  ID       Date Wrk_Hrs Wrk_Schd
#1  1 2003-09-09      32       80
#2  1 2005-10-08      35       40
#3  1 2006-10-21      35       50
#4  1 2007-12-21      35       50
#5  1 2012-09-09      40       50
#6  1 2013-10-09      40       50
#7  1 2017-12-09      36       36
#8  1 2017-12-21      36       36

Upvotes: 4

Onyambu
Onyambu

Reputation: 79228

You will have to create a new column that has both the effective and period dates and use that to merge the two dfs, then you can remove the columns if you want to

 df1$Date=as.Date(df1$Date,format="%m/%d/%Y")
 df2$Date=as.Date(df2$Date,format="%m/%d/%Y")
 s=colSums(outer(df1$Date,df2$Date,`<`))
 df3=df1[s,]
 df3$d=df2$d=paste(df1$Date[s],df2$Date)
 df=merge(df2,df3[-2])
 df[order(df$Date),-2]
 ID       Date Wrk_Hrs Wrk_Schd
1  1 2003-09-09      32       80
2  1 2005-10-08      35       40
3  1 2006-10-21      35       50
4  1 2007-12-21      35       50
5  1 2012-09-09      40       50
6  1 2013-10-09      40       50
7  1 2017-12-09      36       36
8  1 2017-12-21      36       36

Upvotes: 0

Related Questions