governator5410
governator5410

Reputation: 117

R- merging two data sets within time duration/intervals

I am still learning R and having trouble trying to merge two data sets from two different data.table and match it within the time interval. For example given table1_schedule and table2_schedule:

table1_schedule

Channel    Program      program_Date    start_time
HBO        Mov A        1/1/2018        21:00
HBO        Mov B        1/1/2018        23:00
HBO        Mov C        1/1/2018        23:59
NatGeo     Doc A        1/1/2018        11:00
NatGeo     Doc B        1/1/2018        11:30
NatGeo     Doc C        1/1/2018        12:00
NatGeo     Doc D        1/1/2018        14:00

table2_watch

Person    Channel        program_Date       start_time    end_time
Name A    NatGeo             1/1/2018        11:00        12:00
Name B    NatGeo             1/1/2018        12:30        14:00         
Name B    HBO                1/1/2018        21:30        22:00
Name B    HBO                1/1/2018        22:30        23:30

The goal is to merge the programs that run between the "start_time" and "end_time" of the table2_watch table and add the programs watched by the person during that time interval each time. For example,

The wanted output

  Person    Channel   program_Date  start_time  end_time  Prog1  Prog2  Prog3
Name A    NatGeo      1/1/2018      11:00       12:00     Doc A  Doc B  Doc C       
Name B    NatGeo      1/1/2018      12:30       14:00     Doc C  Doc D  -NA- 
Name B    HBO         1/1/2018      21:30       22:00     Mov A  -NA-   -NA- 
Name B    HBO         1/1/2018      22:30       23:30     Mov A  Mov B  -NA-

Is there a way to do this in the simplest and most efficient way such as using dplyr or any other R commands best for this type of problem? And add the watched programs during the time interval only if it goes beyond 10 minutes then add that the person watched the next program. Thanks

Upvotes: 4

Views: 1511

Answers (2)

Maurits Evers
Maurits Evers

Reputation: 50678

Here is a data.table solution where we can make use foverlap.

I'm showing every step with a short comment, to hopefully help with understanding.

library(data.table)

# Convert date & time to POSIXct
# Note that foverlap requires a start and end date, so we create an end date
# from the next start date per channel using shift for df1
setDT(df1)[, `:=`(
    time1 = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
    time2 = as.POSIXct(paste(program_Date, shift(start_time, 1, type = "lead", fill = start_time[.N])), format = "%d/%m/%Y %H:%M")), by = Channel]
setDT(df2)[, `:=`(
    start = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
    end = as.POSIXct(paste(program_Date, end_time), format = "%d/%m/%Y %H:%M"))]

# Remove unnecessary columns in preparation for final output
df1[, `:=`(program_Date = NULL, start_time = NULL)]
df2[, `:=`(program_Date = NULL, start_time = NULL, end_time = NULL)]

# Join on channel and overlapping intervals
# Once joined, remove time1 and time2
setkey(df1, Channel, time1, time2)
dt <- foverlaps(df2, df1, by.x = c("Channel", "start", "end"), nomatch = 0L)
dt[, `:=`(time1 = NULL, time2 = NULL)]

# Spread long to wide
dt[, idx := paste0("Prog",1:.N), by = c("Channel", "Person", "start")]
dcast(dt, Channel + Person + start + end ~ idx, value.var = "Program")[order(Person, start)]
#   Channel Person               start                 end Prog1 Prog2 Prog3
#1:  NatGeo Name A 2018-01-01 11:00:00 2018-01-01 12:00:00 Doc A Doc B Doc C
#2:  NatGeo Name B 2018-01-01 12:30:00 2018-01-01 14:00:00 Doc C Doc D    NA
#3:     HBO Name B 2018-01-01 21:30:00 2018-01-01 22:00:00 Mov A    NA    NA
#4:     HBO Name B 2018-01-01 22:30:00 2018-01-01 23:30:00 Mov A Mov B    NA

Sample data

df1 <- read.table(text =
    "Channel    Program      program_Date    start_time
HBO        'Mov A'        1/1/2018        21:00
HBO        'Mov B'        1/1/2018        23:00
HBO        'Mov C'        1/1/2018        23:59
NatGeo     'Doc A'        1/1/2018        11:00
NatGeo     'Doc B'        1/1/2018        11:30
NatGeo     'Doc C'        1/1/2018        12:00
NatGeo     'Doc D'        1/1/2018        14:00", header = T)


df2 <- read.table(text =
    "Person    Channel        program_Date       start_time    end_time
'Name A'    NatGeo             1/1/2018        11:00        12:00
'Name B'    NatGeo             1/1/2018        12:30        14:00
'Name B'    HBO                1/1/2018        21:30        22:00
'Name B'    HBO                1/1/2018        22:30        23:30", header = T)

Upvotes: 2

d125q
d125q

Reputation: 1666

Here is how I would go about doing this. Note that I renamed some of your stuff.

> cat schedule 
Channel    Program      Date            StartTime
HBO        Mov A        1/1/2018        21:00
HBO        Mov B        1/1/2018        23:00
HBO        Mov C        1/1/2018        23:59
NatGeo     Doc A        1/1/2018        11:00
NatGeo     Doc B        1/1/2018        11:30
NatGeo     Doc C        1/1/2018        12:00
NatGeo     Doc D        1/1/2018        14:00
> cat watch
Person    Channel            Date        StartTime      EndTime
Name A    NatGeo             1/1/2018        11:00        12:00
Name B    NatGeo             1/1/2018        12:30        14:00         
Name B    HBO                1/1/2018        21:30        22:00
Name B    HBO                1/1/2018        22:30        23:30

Now, make sure we read these correctly using readr. In other words, specify the correct formats for the dates and the times.

library(dplyr)
library(readr)
library(lubridate)

schedule <- read_table("schedule",
                       col_types=cols_only(Channel=col_character(),
                                           Program=col_character(),
                                           Date=col_date("%d/%m/%Y"),
                                           StartTime=col_time("%H:%M")))

watch <- read_table("watch",
                    col_types=cols_only(Person=col_character(),
                                        Channel=col_character(),
                                        Date=col_date("%d/%m/%Y"),
                                        StartTime=col_time("%H:%M"),
                                        EndTime=col_time("%H:%M")))

Next, we convert all dates and times to datetimes and add an ending datetime to the schedule.

schedule <- schedule %>%
    mutate(StartDateTime=ymd_hms(paste(Date, StartTime))) %>%
    group_by(Channel) %>%
    mutate(EndDateTime=lead(StartDateTime, default=as_datetime(Inf))) %>%
    ungroup() %>%
    select(Channel, Program, StartDateTime, EndDateTime)

watch <- watch %>%
    mutate(StartDateTime=ymd_hms(paste(Date, StartTime))) %>%
    mutate(EndDateTime=ymd_hms(paste(Date, EndTime))) %>%
    select(Person, Channel, StartDateTime, EndDateTime)

We can perform a join and check if the watch and schedule intervals overlap (you can modify this to accommodate to your 10 minute comment I believe, although I did not fully understand what you meant).

watch %>%
    inner_join(schedule,
               by=c("Channel" = "Channel"),
               suffix=c(".Watch", ".Schedule")) %>%
    filter(int_overlaps(interval(StartDateTime.Watch, EndDateTime.Watch),
                        interval(StartDateTime.Schedule, EndDateTime.Schedule))) %>%
    select(Person, Channel, Program, StartDateTime.Watch, EndDateTime.Watch) %>%
    rename_at(.vars=vars(ends_with(".Watch")),
              .funs=funs(sub("\\.Watch$", "", .)))
# A tibble: 8 x 5
  Person Channel Program StartDateTime       EndDateTime        
  <chr>  <chr>   <chr>   <dttm>              <dttm>             
1 Name A NatGeo  Doc A   2018-01-01 11:00:00 2018-01-01 12:00:00
2 Name A NatGeo  Doc B   2018-01-01 11:00:00 2018-01-01 12:00:00
3 Name A NatGeo  Doc C   2018-01-01 11:00:00 2018-01-01 12:00:00
4 Name B NatGeo  Doc C   2018-01-01 12:30:00 2018-01-01 14:00:00
5 Name B NatGeo  Doc D   2018-01-01 12:30:00 2018-01-01 14:00:00
6 Name B HBO     Mov A   2018-01-01 21:30:00 2018-01-01 22:00:00
7 Name B HBO     Mov A   2018-01-01 22:30:00 2018-01-01 23:30:00
8 Name B HBO     Mov B   2018-01-01 22:30:00 2018-01-01 23:30:00

To get the desired output, you would have to group by everything except Program and "explode" the resulting groups into multiple columns. However, I am not sure if that is a good idea so I did not do it.

Upvotes: 1

Related Questions