Perform calculation on column of data frame that has array of datetimes as values

Question

I am trying to get the time difference between elements of an array a sample of the data is below and the image at the bottom describes the problem I am trying to solve. I have a dataframe column events where each value is an array of date and time entries that correspond to events and other columns which partition time into a before, evaluation and after period. I would like to calculate the statistics on the time between events.

** Update **

Using the excellent answer by danlooo below which gives me almost exactly what I need if I

add the four boundary events corresponding to before_eval_begin, eval_month, after_eval_end to the event array
duration is calculated for consecutive events
the before and after case_when statement is tweaked

the following code appears to work:

   duration <-
  data %>% mutate(across(before_event_eval:after_eval_end,as.character)) %>% 
  as_tibble() %>%
  mutate(
    
    events = events %>% str_remove_all("[\\"]")
  ) %>%
  mutate( events = ifelse(events == "",events,paste0(events,",",
                   before_event_eval,",",as.character(as.Date(eval_month)-days(1)),
                   ",",as.character(ceiling_date(as.Date('2021-02-01'),"month")),
                   ",",after_eval_end))) %>% 
  separate_rows(events, sep = ",") %>%
  rename(event = events) %>%
  filter(event != "") %>%
  mutate(across(before_event_eval:after_eval_end,parse_datetime)) %>% 
  mutate(
    event = event %>% parse_datetime(),  
    position = case_when(
      event >= before_event_eval &
        event < eval_month  ~ "before",
      event <= after_eval_end &
        event > eval_month  ~ "after"
    )
  ) %>% 
  arrange(id,event) %>%  group_by(id) %>% 
  mutate(duration = as.numeric(event - lag(event))) %>% 
  group_by(id,position) %>%
  summarise(time_until_first = first(duration[!is.na(duration)]),
            timebetween_last = last(duration[!is.na(duration)]),
            min_duration = min(duration,na.rm=TRUE),
            avg_duration = mean(duration,na.rm=TRUE),
            max_duration = max(duration,na.rm=TRUE))

I think a general strategy would be as follows but I am not sure how to proceed after step 1 and perform computations on the cleaned array:

remove brackets and parenthesis from string
create ordered vector of events
Determine if event falls before or after eval month:

Before: event is >= before_eval_begin and < eval_month

After: event is > eval_month and <= after_eval_end

Determine time between events for each period (before, after) including times relative to before_eval_begin, eval_month, after_eval_end
Return the below statistics:

If events is missing then all the values below should be set to 185

•   Time to first event in pre period

•   Time between last event in pre period and end of pre period

•   Average time between events for pre period 

•   Minimum of time between events in pre period

•   Maximum of time between events in pre period

•   Time to first event in post period

•   Time between last event in post period and end of post period

•   Minimum of time between events in post period

•   Maximum of time between events in post period

*Edit: removed duplicate events and added id column

Data

   structure(list(id = c(1, 2, 3, 4), before_event_eval = structure(c(1596240000, 
                                                                   1596240000, 1604188800, 1604188800), class = c("POSIXct", "POSIXt"
                                                                   ), tzone = "UTC"), eval_month = structure(c(1612137600, 1612137600, 
                                                                                                               1619827200, 1619827200), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
               after_eval_end = structure(c(1627776000, 1627776000, 1635724800, 
                                            1635724800), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
               events = c("[\"2021-01-28 13:25:32\",\"2021-01-28 18:25:32\"]", 
                          "[\"2021-04-30 18:25:32\",\"2021-01-15 11:25:32\",\"2021-01-30 18:25:32\",\"2021-03-30 18:25:32\",\"2021-01-27 11:25:32\",\"2021-01-30 18:26:32\"]", 
                          "[]", "[\"2021-04-28 13:25:32\",\"2021-05-28 10:25:32\"]"
               )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
                                                                           -4L))

Picture of Problem

danlooo · Accepted Answer

Something like this?

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

data <- structure(list(
  before_event_eval = structure(c(
    1596240000, 1596240000,
    1604188800, 1604188800
  ), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
  eval_month = structure(c(
    1612137600, 1612137600, 1619827200,
    1619827200
  ), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
  after_eval_end = structure(c(
    1627776000, 1627776000, 1635724800,
    1635724800
  ), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
  events = c(
    "[\"2021-01-28 13:25:32\",\"2021-01-28 18:25:32\"]",
    "[\"2021-04-30 18:25:32\",\"2021-01-15 11:25:32\",\"2021-01-30 18:25:32\",\"2021-03-30 18:25:32\",\"2021-01-27 11:25:32\",\"2021-01-30 18:25:32\",\"2021-01-30 18:25:32\"]",
    "[]", "[\"2021-04-28 13:25:32\",\"2021-05-28 10:25:32\"]"
  )
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(
  NA,
  -4L
))

events <-
  data %>%
  as_tibble() %>%
  mutate(
    id = row_number(),
    events = events %>% str_remove_all("[\\"]")
  ) %>%
  separate_rows(events, sep = ",") %>%
  rename(event = events) %>%
  filter(event != "") %>%
  mutate(
    event = event %>% parse_datetime(),
    position = case_when(
      event >= before_event_eval &
        year(event) == year(eval_month) &
        month(event) < month(eval_month) ~ "before",
      event <= after_eval_end &
        year(event) == year(eval_month) &
        month(event) > month(eval_month) ~ "after"
    )
  ) %>%
  arrange(event)
events
#> # A tibble: 11 × 6
#>    before_event_eval   eval_month          after_eval_end     
#>                                             
#>  1 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  2 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  3 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  4 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  5 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  6 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  7 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  8 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  9 2020-11-01 00:00:00 2021-05-01 00:00:00 2021-11-01 00:00:00
#> 10 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 11 2020-11-01 00:00:00 2021-05-01 00:00:00 2021-11-01 00:00:00
#> # … with 3 more variables: event , id , position 

durations <-
  events$event %>%
  as.character() %>%
  unique() %>%
  combn(2) %>%
  t() %>%
  as_tibble() %>%
  transmute(
    from = parse_datetime(V1),
    to = parse_datetime(V2),
    duration = to - from
  ) %>%
  left_join(events, by = c("from" = "event"))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
durations
#> # A tibble: 44 × 8
#>    from                to                  duration   before_event_eval  
#>                                                  
#>  1 2021-01-15 11:25:32 2021-01-27 11:25:32  288 hours 2020-08-01 00:00:00
#>  2 2021-01-15 11:25:32 2021-01-28 13:25:32  314 hours 2020-08-01 00:00:00
#>  3 2021-01-15 11:25:32 2021-01-28 18:25:32  319 hours 2020-08-01 00:00:00
#>  4 2021-01-15 11:25:32 2021-01-30 18:25:32  367 hours 2020-08-01 00:00:00
#>  5 2021-01-15 11:25:32 2021-03-30 18:25:32 1783 hours 2020-08-01 00:00:00
#>  6 2021-01-15 11:25:32 2021-04-28 13:25:32 2474 hours 2020-08-01 00:00:00
#>  7 2021-01-15 11:25:32 2021-04-30 18:25:32 2527 hours 2020-08-01 00:00:00
#>  8 2021-01-15 11:25:32 2021-05-28 10:25:32 3191 hours 2020-08-01 00:00:00
#>  9 2021-01-27 11:25:32 2021-01-28 13:25:32   26 hours 2020-08-01 00:00:00
#> 10 2021-01-27 11:25:32 2021-01-28 18:25:32   31 hours 2020-08-01 00:00:00
#> # … with 34 more rows, and 4 more variables: eval_month ,
#> #   after_eval_end , id , position 

durations %>%
  group_by(position) %>%
  summarise(
    min_duration = min(duration),
    avg_duration = mean(duration),
    max_duration = max(duration)
  )
#> # A tibble: 2 × 4
#>   position min_duration avg_duration   max_duration
#>                             
#> 1 after    664 hours     876.750 hours 1408 hours  
#> 2 before     5 hours    1600.925 hours 3191 hours

^{Created on 2022-04-26 by the reprex package (v2.0.0)}

To only look at consecutive events, one can do

durations <-
  events %>%
  arrange(position, event) %>%
  mutate(
    from = event,
    to = lead(event)
  )

Perform calculation on column of data frame that has array of datetimes as values

Answers (1)

Related Questions