te time
te time

Reputation: 495

Perform calculation on column of data frame that has array of datetimes as values

I am trying to get the time difference between elements of an array a sample of the data is below and the image at the bottom describes the problem I am trying to solve. I have a dataframe column events where each value is an array of date and time entries that correspond to events and other columns which partition time into a before, evaluation and after period. I would like to calculate the statistics on the time between events.

** Update **

Using the excellent answer by danlooo below which gives me almost exactly what I need if I

the following code appears to work:

   duration <-
  data %>% mutate(across(before_event_eval:after_eval_end,as.character)) %>% 
  as_tibble() %>%
  mutate(
    
    events = events %>% str_remove_all("[\\[\\]\\\"]")
  ) %>%
  mutate( events = ifelse(events == "",events,paste0(events,",",
                   before_event_eval,",",as.character(as.Date(eval_month)-days(1)),
                   ",",as.character(ceiling_date(as.Date('2021-02-01'),"month")),
                   ",",after_eval_end))) %>% 
  separate_rows(events, sep = ",") %>%
  rename(event = events) %>%
  filter(event != "") %>%
  mutate(across(before_event_eval:after_eval_end,parse_datetime)) %>% 
  mutate(
    event = event %>% parse_datetime(),  
    position = case_when(
      event >= before_event_eval &
        event < eval_month  ~ "before",
      event <= after_eval_end &
        event > eval_month  ~ "after"
    )
  ) %>% 
  arrange(id,event) %>%  group_by(id) %>% 
  mutate(duration = as.numeric(event - lag(event))) %>% 
  group_by(id,position) %>%
  summarise(time_until_first = first(duration[!is.na(duration)]),
            timebetween_last = last(duration[!is.na(duration)]),
            min_duration = min(duration,na.rm=TRUE),
            avg_duration = mean(duration,na.rm=TRUE),
            max_duration = max(duration,na.rm=TRUE))

I think a general strategy would be as follows but I am not sure how to proceed after step 1 and perform computations on the cleaned array:

  1. remove brackets and parenthesis from string
  2. create ordered vector of events
  3. Determine if event falls before or after eval month:

Before: event is >= before_eval_begin and < eval_month

After: event is > eval_month and <= after_eval_end

  1. Determine time between events for each period (before, after) including times relative to before_eval_begin, eval_month, after_eval_end

  2. Return the below statistics:

If events is missing then all the values below should be set to 185

•   Time to first event in pre period

•   Time between last event in pre period and end of pre period

•   Average time between events for pre period 

•   Minimum of time between events in pre period

•   Maximum of time between events in pre period

•   Time to first event in post period

•   Time between last event in post period and end of post period

•   Minimum of time between events in post period

•   Maximum of time between events in post period

*Edit: removed duplicate events and added id column

Data

   structure(list(id = c(1, 2, 3, 4), before_event_eval = structure(c(1596240000, 
                                                                   1596240000, 1604188800, 1604188800), class = c("POSIXct", "POSIXt"
                                                                   ), tzone = "UTC"), eval_month = structure(c(1612137600, 1612137600, 
                                                                                                               1619827200, 1619827200), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
               after_eval_end = structure(c(1627776000, 1627776000, 1635724800, 
                                            1635724800), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
               events = c("[\"2021-01-28 13:25:32\",\"2021-01-28 18:25:32\"]", 
                          "[\"2021-04-30 18:25:32\",\"2021-01-15 11:25:32\",\"2021-01-30 18:25:32\",\"2021-03-30 18:25:32\",\"2021-01-27 11:25:32\",\"2021-01-30 18:26:32\"]", 
                          "[]", "[\"2021-04-28 13:25:32\",\"2021-05-28 10:25:32\"]"
               )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
                                                                           -4L))

Picture of Problem

enter image description here

Upvotes: 1

Views: 45

Answers (1)

danlooo
danlooo

Reputation: 10627

Something like this?

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

data <- structure(list(
  before_event_eval = structure(c(
    1596240000, 1596240000,
    1604188800, 1604188800
  ), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
  eval_month = structure(c(
    1612137600, 1612137600, 1619827200,
    1619827200
  ), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
  after_eval_end = structure(c(
    1627776000, 1627776000, 1635724800,
    1635724800
  ), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
  events = c(
    "[\"2021-01-28 13:25:32\",\"2021-01-28 18:25:32\"]",
    "[\"2021-04-30 18:25:32\",\"2021-01-15 11:25:32\",\"2021-01-30 18:25:32\",\"2021-03-30 18:25:32\",\"2021-01-27 11:25:32\",\"2021-01-30 18:25:32\",\"2021-01-30 18:25:32\"]",
    "[]", "[\"2021-04-28 13:25:32\",\"2021-05-28 10:25:32\"]"
  )
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(
  NA,
  -4L
))

events <-
  data %>%
  as_tibble() %>%
  mutate(
    id = row_number(),
    events = events %>% str_remove_all("[\\[\\]\\\"]")
  ) %>%
  separate_rows(events, sep = ",") %>%
  rename(event = events) %>%
  filter(event != "") %>%
  mutate(
    event = event %>% parse_datetime(),
    position = case_when(
      event >= before_event_eval &
        year(event) == year(eval_month) &
        month(event) < month(eval_month) ~ "before",
      event <= after_eval_end &
        year(event) == year(eval_month) &
        month(event) > month(eval_month) ~ "after"
    )
  ) %>%
  arrange(event)
events
#> # A tibble: 11 × 6
#>    before_event_eval   eval_month          after_eval_end     
#>    <dttm>              <dttm>              <dttm>             
#>  1 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  2 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  3 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  4 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  5 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  6 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  7 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  8 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#>  9 2020-11-01 00:00:00 2021-05-01 00:00:00 2021-11-01 00:00:00
#> 10 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 11 2020-11-01 00:00:00 2021-05-01 00:00:00 2021-11-01 00:00:00
#> # … with 3 more variables: event <dttm>, id <int>, position <chr>

durations <-
  events$event %>%
  as.character() %>%
  unique() %>%
  combn(2) %>%
  t() %>%
  as_tibble() %>%
  transmute(
    from = parse_datetime(V1),
    to = parse_datetime(V2),
    duration = to - from
  ) %>%
  left_join(events, by = c("from" = "event"))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
durations
#> # A tibble: 44 × 8
#>    from                to                  duration   before_event_eval  
#>    <dttm>              <dttm>              <drtn>     <dttm>             
#>  1 2021-01-15 11:25:32 2021-01-27 11:25:32  288 hours 2020-08-01 00:00:00
#>  2 2021-01-15 11:25:32 2021-01-28 13:25:32  314 hours 2020-08-01 00:00:00
#>  3 2021-01-15 11:25:32 2021-01-28 18:25:32  319 hours 2020-08-01 00:00:00
#>  4 2021-01-15 11:25:32 2021-01-30 18:25:32  367 hours 2020-08-01 00:00:00
#>  5 2021-01-15 11:25:32 2021-03-30 18:25:32 1783 hours 2020-08-01 00:00:00
#>  6 2021-01-15 11:25:32 2021-04-28 13:25:32 2474 hours 2020-08-01 00:00:00
#>  7 2021-01-15 11:25:32 2021-04-30 18:25:32 2527 hours 2020-08-01 00:00:00
#>  8 2021-01-15 11:25:32 2021-05-28 10:25:32 3191 hours 2020-08-01 00:00:00
#>  9 2021-01-27 11:25:32 2021-01-28 13:25:32   26 hours 2020-08-01 00:00:00
#> 10 2021-01-27 11:25:32 2021-01-28 18:25:32   31 hours 2020-08-01 00:00:00
#> # … with 34 more rows, and 4 more variables: eval_month <dttm>,
#> #   after_eval_end <dttm>, id <int>, position <chr>

durations %>%
  group_by(position) %>%
  summarise(
    min_duration = min(duration),
    avg_duration = mean(duration),
    max_duration = max(duration)
  )
#> # A tibble: 2 × 4
#>   position min_duration avg_duration   max_duration
#>   <chr>    <drtn>       <drtn>         <drtn>      
#> 1 after    664 hours     876.750 hours 1408 hours  
#> 2 before     5 hours    1600.925 hours 3191 hours

Created on 2022-04-26 by the reprex package (v2.0.0)

To only look at consecutive events, one can do

durations <-
  events %>%
  arrange(position, event) %>%
  mutate(
    from = event,
    to = lead(event)
  )

Upvotes: 2

Related Questions