R- merging two data sets within time duration/intervals

Question

I am still learning R and having trouble trying to merge two data sets from two different data.table and match it within the time interval. For example given table1_schedule and table2_schedule:

table1_schedule

Channel    Program      program_Date    start_time
HBO        Mov A        1/1/2018        21:00
HBO        Mov B        1/1/2018        23:00
HBO        Mov C        1/1/2018        23:59
NatGeo     Doc A        1/1/2018        11:00
NatGeo     Doc B        1/1/2018        11:30
NatGeo     Doc C        1/1/2018        12:00
NatGeo     Doc D        1/1/2018        14:00

table2_watch

Person    Channel        program_Date       start_time    end_time
Name A    NatGeo             1/1/2018        11:00        12:00
Name B    NatGeo             1/1/2018        12:30        14:00         
Name B    HBO                1/1/2018        21:30        22:00
Name B    HBO                1/1/2018        22:30        23:30

The goal is to merge the programs that run between the "start_time" and "end_time" of the table2_watch table and add the programs watched by the person during that time interval each time. For example,

The wanted output

  Person    Channel   program_Date  start_time  end_time  Prog1  Prog2  Prog3
Name A    NatGeo      1/1/2018      11:00       12:00     Doc A  Doc B  Doc C       
Name B    NatGeo      1/1/2018      12:30       14:00     Doc C  Doc D  -NA- 
Name B    HBO         1/1/2018      21:30       22:00     Mov A  -NA-   -NA- 
Name B    HBO         1/1/2018      22:30       23:30     Mov A  Mov B  -NA-

Is there a way to do this in the simplest and most efficient way such as using dplyr or any other R commands best for this type of problem? And add the watched programs during the time interval only if it goes beyond 10 minutes then add that the person watched the next program. Thanks

Maurits Evers · Accepted Answer

Here is a data.table solution where we can make use foverlap.

I'm showing every step with a short comment, to hopefully help with understanding.

library(data.table)

# Convert date & time to POSIXct
# Note that foverlap requires a start and end date, so we create an end date
# from the next start date per channel using shift for df1
setDT(df1)[, `:=`(
    time1 = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
    time2 = as.POSIXct(paste(program_Date, shift(start_time, 1, type = "lead", fill = start_time[.N])), format = "%d/%m/%Y %H:%M")), by = Channel]
setDT(df2)[, `:=`(
    start = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
    end = as.POSIXct(paste(program_Date, end_time), format = "%d/%m/%Y %H:%M"))]

# Remove unnecessary columns in preparation for final output
df1[, `:=`(program_Date = NULL, start_time = NULL)]
df2[, `:=`(program_Date = NULL, start_time = NULL, end_time = NULL)]

# Join on channel and overlapping intervals
# Once joined, remove time1 and time2
setkey(df1, Channel, time1, time2)
dt <- foverlaps(df2, df1, by.x = c("Channel", "start", "end"), nomatch = 0L)
dt[, `:=`(time1 = NULL, time2 = NULL)]

# Spread long to wide
dt[, idx := paste0("Prog",1:.N), by = c("Channel", "Person", "start")]
dcast(dt, Channel + Person + start + end ~ idx, value.var = "Program")[order(Person, start)]
#   Channel Person               start                 end Prog1 Prog2 Prog3
#1:  NatGeo Name A 2018-01-01 11:00:00 2018-01-01 12:00:00 Doc A Doc B Doc C
#2:  NatGeo Name B 2018-01-01 12:30:00 2018-01-01 14:00:00 Doc C Doc D    NA
#3:     HBO Name B 2018-01-01 21:30:00 2018-01-01 22:00:00 Mov A    NA    NA
#4:     HBO Name B 2018-01-01 22:30:00 2018-01-01 23:30:00 Mov A Mov B    NA

Sample data

df1 <- read.table(text =
    "Channel    Program      program_Date    start_time
HBO        'Mov A'        1/1/2018        21:00
HBO        'Mov B'        1/1/2018        23:00
HBO        'Mov C'        1/1/2018        23:59
NatGeo     'Doc A'        1/1/2018        11:00
NatGeo     'Doc B'        1/1/2018        11:30
NatGeo     'Doc C'        1/1/2018        12:00
NatGeo     'Doc D'        1/1/2018        14:00", header = T)


df2 <- read.table(text =
    "Person    Channel        program_Date       start_time    end_time
'Name A'    NatGeo             1/1/2018        11:00        12:00
'Name B'    NatGeo             1/1/2018        12:30        14:00
'Name B'    HBO                1/1/2018        21:30        22:00
'Name B'    HBO                1/1/2018        22:30        23:30", header = T)

R- merging two data sets within time duration/intervals

Answers (2)

Sample data

Related Questions