Reputation: 339
I have a two datatables like below:
DT1
id start_time end_time
604 2017-08-10 18:44:14 2017-08-11 19:33:17
604 2017-08-10 20:38:20 2017-08-11 20:44:44
604 2017-08-10 20:54:26 2017-08-11 20:58:48
604 2017-08-10 21:35:50 2017-08-11 22:03:14
604 2017-08-10 22:05:42 2017-08-11 22:17:12
.......
DT2
id t1 t2
604 2017-08-10 18:40:14 2017-08-11 18:44:14
604 2017-08-10 18:44:14 2017-08-11 18:47:14
604 2017-08-10 19:44:14 2017-08-11 19:47:14
604 2017-08-10 20:30:14 2017-08-11 20:42:20
604 2017-08-10 21:44:14 2017-08-11 21:49:14
604 2017-08-10 22:44:14 2017-08-11 22:48:14
......
From these two I want to identify the accumulation of the DT2 rows lying inbetween the range of DT!.For example row 1 is starting from 2017-08-10 18:44:14 to 2017-08-11 19:33:17
So I want to take the accumulation of time in DT between that period and attach as column fro eah row in DT1
for example the first row would be like
id start_time end_time durationFromDT2
604 2017-08-10 18:44:14 2017-08-11 19:33:17 420
604 2017-08-10 20:38:20 2017-08-11 20:44:44 240
The 420 is because of (2017-08-11 18:44:14-2017-08-10 18:40:14)+(2017-08-11 18:47:14-2017-08-10 18:44:14)
.If there are no occurences in DT2 in the time period then I would like it to be 0.Like wise I have to group by many id's as well.
The 240 because of the time is (2017-08-11 20:42:20-2017-08-10 20:38:20)
So it's basically the coverage of DT2 for DT1 time period I tried looping through the rows but did not go well.Instead I am looking for any dplyr or data table solutions.Because looping was not working well
Any help is appreciated.
Upvotes: 1
Views: 85
Reputation: 14958
Assuming the rectifications suggested in the comments are correct, please find a dplyr
solution below:
merge(DT1, DT2, by = "id", all = TRUE) %>%
filter(t2 >= start_time, t1 <= end_time) %>%
mutate(t1_adj = if_else(start_time > t1, start_time, t1),
t2_adj = if_else(end_time < t2, end_time, t2),
difftime = difftime(t2_adj, t1_adj, units = "secs")) %>%
group_by(id, start_time, end_time) %>%
summarize(durationFromDT2 = sum(difftime)) %>%
right_join(DT1) %>%
mutate(durationFromDT2 = coalesce(durationFromDT2, 0))
cross join
all records between each of the tables
filter
down the cross join
to contain just those DT2
times that have a timepoint within the DT1
interval.
mutate
_adj
usted columns to force the t1
and t2
times to remain constrained between DT1
start and end timepoints, and then calculate the time intervals.
group
and summarize
the durations by DT1
records.
To list all of the NULL
records in the final result, right_join
the DT1
table back on again.
Replace
the NA
s with 0's.
The result looks like this:
# A tibble: 5 x 4 # Groups: id, start_time [5] id start_time end_time durationFromDT2 <int> <dttm> <dttm> <time> 1 604 2017-08-10 18:44:14 2017-08-10 19:33:17 180 secs 2 604 2017-08-10 20:38:20 2017-08-10 20:44:44 240 secs 3 604 2017-08-10 20:54:26 2017-08-10 20:58:48 0 secs 4 604 2017-08-10 21:35:50 2017-08-10 22:03:14 300 secs 5 604 2017-08-10 22:05:42 2017-08-10 22:17:12 0 secs
Reproducible (rectified) sample dataframes from the OP are below:
library(lubridate)
DT1 <-
read.table(text = "
id start_date start_time end_date end_time
604 2017-08-10 18:44:14 2017-08-10 19:33:17
604 2017-08-10 20:38:20 2017-08-10 20:44:44
604 2017-08-10 20:54:26 2017-08-10 20:58:48
604 2017-08-10 21:35:50 2017-08-10 22:03:14
604 2017-08-10 22:05:42 2017-08-10 22:17:12
", header = TRUE, stringsAsFactors = FALSE) %>%
mutate(start_time = ymd_hms(paste(start_date, start_time)),
end_time = ymd_hms(paste(end_date, end_time))) %>%
select(-c(start_date, end_date))
DT2 <-
read.table(text = "
id d1 t1 d2 t2
604 2017-08-10 18:40:14 2017-08-10 18:44:14
604 2017-08-10 18:44:14 2017-08-10 18:47:14
604 2017-08-10 19:44:14 2017-08-10 19:47:14
604 2017-08-10 20:30:14 2017-08-10 20:42:20
604 2017-08-10 21:44:14 2017-08-10 21:49:14
604 2017-08-10 22:44:14 2017-08-10 22:48:14
", header = TRUE, stringsAsFactors = FALSE) %>%
mutate(t1 = ymd_hms(paste(d1,t1)),
t2 = ymd_hms(paste(d2,t2)),
) %>%
select(-c(d1, d2))
Upvotes: 2