Ryan
Ryan

Reputation: 1068

Summarizing observations within date and time boundaries in R

I have a dataframe called ActiveData that shows the date and times (StartDT and EndDT) when individual IDs were active. Active is structured like this where StartDTandEndDT` are formatted in mdy_hms:

ID         StartDT               EndDT
1  05/05/2021 8:15:00    05/05/2021 9:15:00
2  05/05/2021 8:15:00    05/05/2021 9:15:00
3  05/05/2021 8:15:00    05/05/2021 10:15:00
…

I have another dataframe called Observations that shows observations where each ID observed themselves or another ID satisfying some variable. Here, ID denotes the observer and IDobserved denotes which ID was observed satisfying the variable (IDs can also observe themselves).

ID       DT                          IDobserved
1      05/05/2021 8:19:00                1
1      05/05/2021 8:20:00                1      
1      05/05/2021 8:19:00                2
2      05/05/2021 8:19:20                1
2      05/05/2021 8:19:45                3
3      05/05/2021 8:19:00                1
3      05/05/2021 8:20:00                1
3      05/05/2021 8:25:00                1
3      05/05/2021 8:45:00                3
3      05/05/2021 8:19:00                2
…

I want to summarize the number of times that each ID observed the other IDs (including themselves) satisfying the variable within the time constraints specified by StartDT and EndDT in the ActiveData dataframe, so that the final table would specify the number of observations, and the amount of time in seconds that passed between the boundaries that each ID was actively observing (between StartDT and EndDT in ActiveData). So for the data above, the final table would look like this:

ID   IDobserved   Observations      TimeElapsed
1         1            2             3600
1         2            1             3600
2         1            1             3600
2         3            1             3600
3         1            3             7200
3         2            1             7200
3         3            1             7200

How can this be done?

Upvotes: 3

Views: 120

Answers (2)

ktiu
ktiu

Reputation: 2626

Cool question! With your data

ActiveData <- tibble::tribble(
  ~ID, ~StartDT,             ~EndDT,
  1,   "05/05/2021 8:15:00", "05/05/2021 9:15:00",
  2,   "05/05/2021 8:15:00", "05/05/2021 9:15:00",
  3,   "05/05/2021 8:15:00", "05/05/2021 10:15:00"
)

Observations <- tibble::tribble(
  ~ID, ~DT,                ~IDobserved,
  1,   "05/05/2021 8:19:00", 1,
  1,   "05/05/2021 8:20:00", 1,
  1,   "05/05/2021 8:19:00", 2,
  2,   "05/05/2021 8:19:20", 1,
  2,   "05/05/2021 8:19:45", 3,
  3,   "05/05/2021 8:19:00", 1,
  3,   "05/05/2021 8:20:00", 1,
  3,   "05/05/2021 8:25:00", 1,
  3,   "05/05/2021 8:45:00", 3,
  3,   "05/05/2021 8:19:00", 2
)

I would do

library(dplyr)

fmt <- "%d/%m/%Y %H:%M:%S"

ActiveData %>%
  mutate(across(-ID, ~ as.POSIXct(., format = fmt))) %>%
  purrr::pmap(\(...) {
    args <- list(...)
    Observations %>%
      mutate(DT = as.POSIXct(DT, format = fmt)) %>%
      filter(DT >= args$StartDT, DT <= args$EndDT, ID == args$ID) %>%
      count(ID, IDobserved, name = "Observations") %>%
      mutate(TimeElapsed = difftime(args$EndDT,
                                    args$StartDT,
                                    units =  "secs"))
  }) %>%
  bind_rows()

returning

# A tibble: 7 x 4
     ID IDobserved Observations TimeElapsed
  <dbl>      <dbl>        <int> <drtn>
1     1          1            2 3600 secs
2     1          2            1 3600 secs
3     2          1            1 3600 secs
4     2          3            1 3600 secs
5     3          1            3 7200 secs
6     3          2            1 7200 secs
7     3          3            1 7200 secs

Upvotes: 1

akrun
akrun

Reputation: 887148

Here is a method using data.table.

  1. Convert data.frame to data.table with setDT
  2. Convert the 'DT', 'StartDT', 'EndDT' columns in both dataset to datetime class (mdy_hms)
  3. Create the 'Observations' column in second dataset (df2) as the number of observations per group ID, IDobserved
  4. Do a join with the first data on the ID column, specify, summarise by getting the sum of difference between the DT, StartDT, EndDT columns using difftime and return the unique rows
library(data.table)
library(lubridate)
setDT(df1)[, c('StartDT', 'EndDT') := lapply(.SD, mdy_hms),
       .SDcols = 2:3]
setDT(df2)[, DT := mdy_hms(DT)]
df2[, Observations := .N, .(ID, IDobserved)]
unique(df2[df1, .(IDobserved, Observations,
      TimeElapsed = as.numeric(difftime(DT, StartDT, units = 'sec') +
        difftime(EndDT, DT, units = 'sec'))), on = .(ID), by = .EACHI])

-output

    ID IDobserved Observations TimeElapsed
1:  1          1            2        3600
2:  1          2            1        3600
3:  2          1            1        3600
4:  2          3            1        3600
5:  3          1            3        7200
6:  3          3            1        7200
7:  3          2            1        7200

data


df1 <- structure(list(ID = 1:3, StartDT = c("05/05/2021 8:15:00", 
"05/05/2021 8:15:00", 
"05/05/2021 8:15:00"), EndDT = c("05/05/2021 9:15:00", "05/05/2021 9:15:00", 
"05/05/2021 10:15:00")), class = "data.frame", row.names = c(NA, 
-3L))

df2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), 
    DT = c("05/05/2021 8:19:00", "05/05/2021 8:20:00", "05/05/2021 8:19:00", 
    "05/05/2021 8:19:20", "05/05/2021 8:19:45", "05/05/2021 8:19:00", 
    "05/05/2021 8:20:00", "05/05/2021 8:25:00", "05/05/2021 8:45:00", 
    "05/05/2021 8:19:00"), IDobserved = c(1L, 1L, 2L, 1L, 3L, 
    1L, 1L, 1L, 3L, 2L)), class = "data.frame", row.names = c(NA, 
-10L))

Upvotes: 1

Related Questions