Reputation: 495
I am trying to get the time difference between elements of an array a sample of the data is below and the image at the bottom describes the problem I am trying to solve. I have a dataframe column events
where each value is an array of date and time entries that correspond to events and other columns which partition time into a before, evaluation and after period. I would like to calculate the statistics on the time between events.
** Update **
Using the excellent answer by danlooo
below which gives me almost exactly what I need if I
case_when
statement is tweakedthe following code appears to work:
duration <-
data %>% mutate(across(before_event_eval:after_eval_end,as.character)) %>%
as_tibble() %>%
mutate(
events = events %>% str_remove_all("[\\[\\]\\\"]")
) %>%
mutate( events = ifelse(events == "",events,paste0(events,",",
before_event_eval,",",as.character(as.Date(eval_month)-days(1)),
",",as.character(ceiling_date(as.Date('2021-02-01'),"month")),
",",after_eval_end))) %>%
separate_rows(events, sep = ",") %>%
rename(event = events) %>%
filter(event != "") %>%
mutate(across(before_event_eval:after_eval_end,parse_datetime)) %>%
mutate(
event = event %>% parse_datetime(),
position = case_when(
event >= before_event_eval &
event < eval_month ~ "before",
event <= after_eval_end &
event > eval_month ~ "after"
)
) %>%
arrange(id,event) %>% group_by(id) %>%
mutate(duration = as.numeric(event - lag(event))) %>%
group_by(id,position) %>%
summarise(time_until_first = first(duration[!is.na(duration)]),
timebetween_last = last(duration[!is.na(duration)]),
min_duration = min(duration,na.rm=TRUE),
avg_duration = mean(duration,na.rm=TRUE),
max_duration = max(duration,na.rm=TRUE))
I think a general strategy would be as follows but I am not sure how to proceed after step 1 and perform computations on the cleaned array:
Before: event is >= before_eval_begin and < eval_month
After: event is > eval_month and <= after_eval_end
Determine time between events for each period (before, after) including times relative to before_eval_begin, eval_month, after_eval_end
Return the below statistics:
If events is missing then all the values below should be set to 185
• Time to first event in pre period
• Time between last event in pre period and end of pre period
• Average time between events for pre period
• Minimum of time between events in pre period
• Maximum of time between events in pre period
• Time to first event in post period
• Time between last event in post period and end of post period
• Minimum of time between events in post period
• Maximum of time between events in post period
*Edit: removed duplicate events and added id column
Data
structure(list(id = c(1, 2, 3, 4), before_event_eval = structure(c(1596240000,
1596240000, 1604188800, 1604188800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), eval_month = structure(c(1612137600, 1612137600,
1619827200, 1619827200), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
after_eval_end = structure(c(1627776000, 1627776000, 1635724800,
1635724800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
events = c("[\"2021-01-28 13:25:32\",\"2021-01-28 18:25:32\"]",
"[\"2021-04-30 18:25:32\",\"2021-01-15 11:25:32\",\"2021-01-30 18:25:32\",\"2021-03-30 18:25:32\",\"2021-01-27 11:25:32\",\"2021-01-30 18:26:32\"]",
"[]", "[\"2021-04-28 13:25:32\",\"2021-05-28 10:25:32\"]"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))
Picture of Problem
Upvotes: 1
Views: 45
Reputation: 10627
Something like this?
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
data <- structure(list(
before_event_eval = structure(c(
1596240000, 1596240000,
1604188800, 1604188800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
eval_month = structure(c(
1612137600, 1612137600, 1619827200,
1619827200
), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
after_eval_end = structure(c(
1627776000, 1627776000, 1635724800,
1635724800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
events = c(
"[\"2021-01-28 13:25:32\",\"2021-01-28 18:25:32\"]",
"[\"2021-04-30 18:25:32\",\"2021-01-15 11:25:32\",\"2021-01-30 18:25:32\",\"2021-03-30 18:25:32\",\"2021-01-27 11:25:32\",\"2021-01-30 18:25:32\",\"2021-01-30 18:25:32\"]",
"[]", "[\"2021-04-28 13:25:32\",\"2021-05-28 10:25:32\"]"
)
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(
NA,
-4L
))
events <-
data %>%
as_tibble() %>%
mutate(
id = row_number(),
events = events %>% str_remove_all("[\\[\\]\\\"]")
) %>%
separate_rows(events, sep = ",") %>%
rename(event = events) %>%
filter(event != "") %>%
mutate(
event = event %>% parse_datetime(),
position = case_when(
event >= before_event_eval &
year(event) == year(eval_month) &
month(event) < month(eval_month) ~ "before",
event <= after_eval_end &
year(event) == year(eval_month) &
month(event) > month(eval_month) ~ "after"
)
) %>%
arrange(event)
events
#> # A tibble: 11 × 6
#> before_event_eval eval_month after_eval_end
#> <dttm> <dttm> <dttm>
#> 1 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 2 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 3 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 4 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 5 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 6 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 7 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 8 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 9 2020-11-01 00:00:00 2021-05-01 00:00:00 2021-11-01 00:00:00
#> 10 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 11 2020-11-01 00:00:00 2021-05-01 00:00:00 2021-11-01 00:00:00
#> # … with 3 more variables: event <dttm>, id <int>, position <chr>
durations <-
events$event %>%
as.character() %>%
unique() %>%
combn(2) %>%
t() %>%
as_tibble() %>%
transmute(
from = parse_datetime(V1),
to = parse_datetime(V2),
duration = to - from
) %>%
left_join(events, by = c("from" = "event"))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
durations
#> # A tibble: 44 × 8
#> from to duration before_event_eval
#> <dttm> <dttm> <drtn> <dttm>
#> 1 2021-01-15 11:25:32 2021-01-27 11:25:32 288 hours 2020-08-01 00:00:00
#> 2 2021-01-15 11:25:32 2021-01-28 13:25:32 314 hours 2020-08-01 00:00:00
#> 3 2021-01-15 11:25:32 2021-01-28 18:25:32 319 hours 2020-08-01 00:00:00
#> 4 2021-01-15 11:25:32 2021-01-30 18:25:32 367 hours 2020-08-01 00:00:00
#> 5 2021-01-15 11:25:32 2021-03-30 18:25:32 1783 hours 2020-08-01 00:00:00
#> 6 2021-01-15 11:25:32 2021-04-28 13:25:32 2474 hours 2020-08-01 00:00:00
#> 7 2021-01-15 11:25:32 2021-04-30 18:25:32 2527 hours 2020-08-01 00:00:00
#> 8 2021-01-15 11:25:32 2021-05-28 10:25:32 3191 hours 2020-08-01 00:00:00
#> 9 2021-01-27 11:25:32 2021-01-28 13:25:32 26 hours 2020-08-01 00:00:00
#> 10 2021-01-27 11:25:32 2021-01-28 18:25:32 31 hours 2020-08-01 00:00:00
#> # … with 34 more rows, and 4 more variables: eval_month <dttm>,
#> # after_eval_end <dttm>, id <int>, position <chr>
durations %>%
group_by(position) %>%
summarise(
min_duration = min(duration),
avg_duration = mean(duration),
max_duration = max(duration)
)
#> # A tibble: 2 × 4
#> position min_duration avg_duration max_duration
#> <chr> <drtn> <drtn> <drtn>
#> 1 after 664 hours 876.750 hours 1408 hours
#> 2 before 5 hours 1600.925 hours 3191 hours
Created on 2022-04-26 by the reprex package (v2.0.0)
To only look at consecutive events, one can do
durations <-
events %>%
arrange(position, event) %>%
mutate(
from = event,
to = lead(event)
)
Upvotes: 2