Reputation: 117
I am still learning R and having trouble trying to merge two data sets from two different data.table and match it within the time interval. For example given table1_schedule and table2_schedule:
table1_schedule
Channel Program program_Date start_time
HBO Mov A 1/1/2018 21:00
HBO Mov B 1/1/2018 23:00
HBO Mov C 1/1/2018 23:59
NatGeo Doc A 1/1/2018 11:00
NatGeo Doc B 1/1/2018 11:30
NatGeo Doc C 1/1/2018 12:00
NatGeo Doc D 1/1/2018 14:00
table2_watch
Person Channel program_Date start_time end_time
Name A NatGeo 1/1/2018 11:00 12:00
Name B NatGeo 1/1/2018 12:30 14:00
Name B HBO 1/1/2018 21:30 22:00
Name B HBO 1/1/2018 22:30 23:30
The goal is to merge the programs that run between the "start_time" and "end_time" of the table2_watch table and add the programs watched by the person during that time interval each time. For example,
The wanted output
Person Channel program_Date start_time end_time Prog1 Prog2 Prog3
Name A NatGeo 1/1/2018 11:00 12:00 Doc A Doc B Doc C
Name B NatGeo 1/1/2018 12:30 14:00 Doc C Doc D -NA-
Name B HBO 1/1/2018 21:30 22:00 Mov A -NA- -NA-
Name B HBO 1/1/2018 22:30 23:30 Mov A Mov B -NA-
Is there a way to do this in the simplest and most efficient way such as using dplyr
or any other R commands best for this type of problem? And add the watched programs during the time interval only if it goes beyond 10 minutes then add that the person watched the next program. Thanks
Upvotes: 4
Views: 1511
Reputation: 50678
Here is a data.table
solution where we can make use foverlap
.
I'm showing every step with a short comment, to hopefully help with understanding.
library(data.table)
# Convert date & time to POSIXct
# Note that foverlap requires a start and end date, so we create an end date
# from the next start date per channel using shift for df1
setDT(df1)[, `:=`(
time1 = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
time2 = as.POSIXct(paste(program_Date, shift(start_time, 1, type = "lead", fill = start_time[.N])), format = "%d/%m/%Y %H:%M")), by = Channel]
setDT(df2)[, `:=`(
start = as.POSIXct(paste(program_Date, start_time), format = "%d/%m/%Y %H:%M"),
end = as.POSIXct(paste(program_Date, end_time), format = "%d/%m/%Y %H:%M"))]
# Remove unnecessary columns in preparation for final output
df1[, `:=`(program_Date = NULL, start_time = NULL)]
df2[, `:=`(program_Date = NULL, start_time = NULL, end_time = NULL)]
# Join on channel and overlapping intervals
# Once joined, remove time1 and time2
setkey(df1, Channel, time1, time2)
dt <- foverlaps(df2, df1, by.x = c("Channel", "start", "end"), nomatch = 0L)
dt[, `:=`(time1 = NULL, time2 = NULL)]
# Spread long to wide
dt[, idx := paste0("Prog",1:.N), by = c("Channel", "Person", "start")]
dcast(dt, Channel + Person + start + end ~ idx, value.var = "Program")[order(Person, start)]
# Channel Person start end Prog1 Prog2 Prog3
#1: NatGeo Name A 2018-01-01 11:00:00 2018-01-01 12:00:00 Doc A Doc B Doc C
#2: NatGeo Name B 2018-01-01 12:30:00 2018-01-01 14:00:00 Doc C Doc D NA
#3: HBO Name B 2018-01-01 21:30:00 2018-01-01 22:00:00 Mov A NA NA
#4: HBO Name B 2018-01-01 22:30:00 2018-01-01 23:30:00 Mov A Mov B NA
df1 <- read.table(text =
"Channel Program program_Date start_time
HBO 'Mov A' 1/1/2018 21:00
HBO 'Mov B' 1/1/2018 23:00
HBO 'Mov C' 1/1/2018 23:59
NatGeo 'Doc A' 1/1/2018 11:00
NatGeo 'Doc B' 1/1/2018 11:30
NatGeo 'Doc C' 1/1/2018 12:00
NatGeo 'Doc D' 1/1/2018 14:00", header = T)
df2 <- read.table(text =
"Person Channel program_Date start_time end_time
'Name A' NatGeo 1/1/2018 11:00 12:00
'Name B' NatGeo 1/1/2018 12:30 14:00
'Name B' HBO 1/1/2018 21:30 22:00
'Name B' HBO 1/1/2018 22:30 23:30", header = T)
Upvotes: 2
Reputation: 1666
Here is how I would go about doing this. Note that I renamed some of your stuff.
> cat schedule
Channel Program Date StartTime
HBO Mov A 1/1/2018 21:00
HBO Mov B 1/1/2018 23:00
HBO Mov C 1/1/2018 23:59
NatGeo Doc A 1/1/2018 11:00
NatGeo Doc B 1/1/2018 11:30
NatGeo Doc C 1/1/2018 12:00
NatGeo Doc D 1/1/2018 14:00
> cat watch
Person Channel Date StartTime EndTime
Name A NatGeo 1/1/2018 11:00 12:00
Name B NatGeo 1/1/2018 12:30 14:00
Name B HBO 1/1/2018 21:30 22:00
Name B HBO 1/1/2018 22:30 23:30
Now, make sure we read these correctly using readr
. In other words, specify the correct formats for the dates and the times.
library(dplyr)
library(readr)
library(lubridate)
schedule <- read_table("schedule",
col_types=cols_only(Channel=col_character(),
Program=col_character(),
Date=col_date("%d/%m/%Y"),
StartTime=col_time("%H:%M")))
watch <- read_table("watch",
col_types=cols_only(Person=col_character(),
Channel=col_character(),
Date=col_date("%d/%m/%Y"),
StartTime=col_time("%H:%M"),
EndTime=col_time("%H:%M")))
Next, we convert all dates and times to datetimes and add an ending datetime to the schedule.
schedule <- schedule %>%
mutate(StartDateTime=ymd_hms(paste(Date, StartTime))) %>%
group_by(Channel) %>%
mutate(EndDateTime=lead(StartDateTime, default=as_datetime(Inf))) %>%
ungroup() %>%
select(Channel, Program, StartDateTime, EndDateTime)
watch <- watch %>%
mutate(StartDateTime=ymd_hms(paste(Date, StartTime))) %>%
mutate(EndDateTime=ymd_hms(paste(Date, EndTime))) %>%
select(Person, Channel, StartDateTime, EndDateTime)
We can perform a join and check if the watch and schedule intervals overlap (you can modify this to accommodate to your 10 minute comment I believe, although I did not fully understand what you meant).
watch %>%
inner_join(schedule,
by=c("Channel" = "Channel"),
suffix=c(".Watch", ".Schedule")) %>%
filter(int_overlaps(interval(StartDateTime.Watch, EndDateTime.Watch),
interval(StartDateTime.Schedule, EndDateTime.Schedule))) %>%
select(Person, Channel, Program, StartDateTime.Watch, EndDateTime.Watch) %>%
rename_at(.vars=vars(ends_with(".Watch")),
.funs=funs(sub("\\.Watch$", "", .)))
# A tibble: 8 x 5
Person Channel Program StartDateTime EndDateTime
<chr> <chr> <chr> <dttm> <dttm>
1 Name A NatGeo Doc A 2018-01-01 11:00:00 2018-01-01 12:00:00
2 Name A NatGeo Doc B 2018-01-01 11:00:00 2018-01-01 12:00:00
3 Name A NatGeo Doc C 2018-01-01 11:00:00 2018-01-01 12:00:00
4 Name B NatGeo Doc C 2018-01-01 12:30:00 2018-01-01 14:00:00
5 Name B NatGeo Doc D 2018-01-01 12:30:00 2018-01-01 14:00:00
6 Name B HBO Mov A 2018-01-01 21:30:00 2018-01-01 22:00:00
7 Name B HBO Mov A 2018-01-01 22:30:00 2018-01-01 23:30:00
8 Name B HBO Mov B 2018-01-01 22:30:00 2018-01-01 23:30:00
To get the desired output, you would have to group by everything except Program
and "explode" the resulting groups into multiple columns. However, I am not sure if that is a good idea so I did not do it.
Upvotes: 1