Reputation: 17
In the data frame below, I have sequences of events that can be repeated until a new event is generated.
I'd like to use a function that helps me after selecting these similar rows, calculates the difference between the timestamp_end
of the last seen event and the timestamp_start
of the first event.
The dataframe :
DF1 <- data.frame(segment_id = c(1, 1, 1, 1, 2 , 3, 4), first_event= c("a", "a", "a","a", "a", "b","c" ), second_event = c("a", "a","a","a", "b", "c", "c"), timestamp_start = c("2019-06-06 11:47:00","2019-06-06 12:59:38", "2019-06-06 13:01:03", "2019-06-06 14:47:03 ", "2019-06-06 18:47:00", "2019-06-06 22:47:00", "2019-06-07 02:47:00") , timestamp_end = c("2019-06-06 12:59:38", "2019-06-06 13:01:03", "2019-06-06 14:47:03", "2019-06-06 18:47:00", "2019-06-06 22:47:00 ", "2019-06-07 02:47:00", "2019-06-07 06:47:00"))
segment_id first_event second_event timestamp_start timestamp_end
1 a a 2019-06-06 11:47:00 2019-06-06 12:59:38
1 a a 2019-06-06 12:59:38 2019-06-06 13:01:03
1 a a 2019-06-06 13:01:03 2019-06-06 14:47:03
1 a a 2019-06-06 14:47:03 2019-06-06 18:47:00
2 a b 2019-06-06 18:47:00 2019-06-06 22:47:00
3 b c 2019-06-06 22:47:00 2019-06-07 02:47:00
4 c c 2019-06-07 02:47:00 2019-06-07 06:47:00
So, I have tried the dplyr
package and group_by()
and mutate()
functions. However, I am not sure which function can help me to get the duration.
DF2 <- DF1 %>%
group_by(segment_id)%>%
mutate("duration" = difftime(????) , units = 'hours')
The final result I am looking for as DF2
should looks like this:
>DF2
segment_id first_event second_event timestamp_start timestamp_end duration
1 a a 2019-06-06 11:47:00 2019-06-06 18:47:00 7
2 a b 2019-06-06 18:47:00 2019-06-06 22:47:00 4
3 b c 2019-06-06 22:47:00 2019-06-07 02:47:00 4
4 c c 2019-06-07 02:47:00 2019-06-07 06:47:00 4
I'd appreciate your help on this one.
Upvotes: 0
Views: 280
Reputation: 25225
An option using data.table
:
setDT(DF1)[, .(timestamp_start=min(timestamp_start), timestamp_end=max(timestamp_end)), .(segment_id, first_event, second_event)][,
duration := difftime(timestamp_end, timestamp_start, units="hours")][]
data:
library(data.table)
cols <- c("timestamp_start", "timestamp_end")
setDT(DF1)[, (cols) := lapply(.SD, as.POSIXct, format="%Y-%m-%d %T"), .SDcols=cols]
Upvotes: 2
Reputation: 8117
library(lubridate)
DF1 %>%
mutate_at(vars(timestamp_start, timestamp_end)
, function(x) ymd_hms(as.character(x))) %>%
group_by(segment_id) %>%
summarise(first_event = first(first_event)
, second_event = last(second_event)
, timestamp_start = first(timestamp_start)
, timestamp_end = last(timestamp_end)
, duration = max(timestamp_end) - min(timestamp_start))
segment_id first_event second_event timestamp_start timestamp_end duration
<dbl> <fct> <fct> <dttm> <dttm> <drtn>
1 1 a a 2019-06-06 11:47:00 2019-06-06 18:47:00 7 hours
2 2 a b 2019-06-06 18:47:00 2019-06-06 22:47:00 4 hours
3 3 b c 2019-06-06 22:47:00 2019-06-07 02:47:00 4 hours
4 4 c c 2019-06-07 02:47:00 2019-06-07 06:47:00 4 hours
Upvotes: 2
Reputation: 388817
We could change the column timestamp_start
and timestamp_end
to POSIXct
type, group_by
segment_id
, first_event
and second_event
, get first
timestamp_start
and last
timestamp_end
and calculate the difference in hours between them.
library(dplyr)
DF1 %>%
mutate_at(vars(starts_with('timestamp')), as.POSIXct) %>%
group_by(segment_id, first_event, second_event) %>%
summarise(timestamp_start = first(timestamp_start),
timestamp_end = last(timestamp_end),
duration = as.numeric(difftime(timestamp_end,
timestamp_start, units = "hours")))
# segment_id first_event second_event timestamp_start timestamp_end duration
# <dbl> <fct> <fct> <dttm> <dttm> <dbl>
#1 1 a a 2019-06-06 11:47:00 2019-06-06 18:47:00 7
#2 2 a b 2019-06-06 18:47:00 2019-06-06 22:47:00 4
#3 3 b c 2019-06-06 22:47:00 2019-06-07 02:47:00 4
#4 4 c c 2019-06-07 02:47:00 2019-06-07 06:47:00 4
Upvotes: 1