How to de-aggregate time interval data in R?

Question

I have data in the form of start and stop times (in the format minutes:seconds). A simplistic example might be the timestamp of a light turning on, and the subsequent timestamp of the light turning off.

For example:

Start    Stop   
00:03.1  00:40.9
00:55.0  01:38.2
01:40.0  02:01.1

I would like to rearrange the data so that I can eventually look at it in terms of whole-minute interval bins in R.

Option 1: Turn the data into a binary listing for each tenth of a second, then aggregate the data later by timestamp.

Time.in.sec   Yes.or.No
0.0           N
0.1           N
...           ...
3.0           N
3.1           Y
3.2           Y
...           ...
40.8          Y
40.9          N
...           ...

Option 2: Split the time intervals at the minute marks and aggregate total time per minute (starting at time = 0:00.0) with some sort of logical rule.

Start        Stop
00:03.10     00:40.90
00:55.00     00:59.99
01:00.00     01:38.20
01:40.00     01:59.99
02:00.00     02:01.10

I have tried looking into lubridate functions (i.e., making each range into an interval class) and cut(), but I can’t seem to figure out how to make either of these ideas work. I also am unclear whether packages such as zoo would be appropriate for this; honestly, I have very little experience with date/time formats and time series.

Other questions on Stackoverflow seem to be addressing making bins from raw timestamps (e.g., What is an efficient method for partitioning and aggregating intervals from timestamped rows in a data frame? and Aggregate data by equally spaced time intervals in R), but I essentially want to do the opposite.

EDIT 1: Here is a CSV-format of the example data, up through minute 6.

Start, Stop 
00:03.1, 00:40.9
00:55.0, 01:38.2
01:40.0, 02:01.1
03:03.1, 04:30.3
04:50.0, 05:01.5
05:08.7, 05:22.0
05:40.1, 05:47.9

EDIT 2: My ultimate goal for this is to have the data in a format that I can use to chunk the observations into standardized time bins (Minute 1, Minute 2, etc.) to get a by-minute percentage of when the data is "Yes". Basically I want to get a summary of the distribution of states by minute, and since the data is binary, I can do this by looking at the "yes" state.

For the first 3 minutes (from 00:00.0 up until 03:00.0), the output would be something like this:

Minute  time.yes.sec  perc.time.yes
1       42.8          71.33
2       58.2          96.98
3       1.1           1.83

# *NOTE: Here, Minute 1 = [0, 60), Minute 2 = [60, 120), etc.; I'm not opposed 
# to the reverse definitions though (Minute 1 = (0, 60], etc.).

I could alternatively look at the data as a cumulative distribution plot, with each successive time point updating the value of "total time yes". However, If I could get the data in the format of option 1, I would have the flexibility to look at the data either way.

alistaire · Accepted Answer

An option, lightly edited from my version in the comments:

library(tidyverse)
library(lubridate)

df %>% mutate_all(funs(period_to_seconds(ms(.)))) %>%    # convert each time to seconds
    rowwise() %>%    # evaluate the following row-by-row
    # make a sequence from Start to Stop by 0.1, wrapped in a list
    mutate(instant = list(seq(Start, Stop, by = 0.1))) %>% 
    unnest() %>%    # expand list column
    # make a factor, cutting instants into 60 second bins
    mutate(minute = cut(instant, breaks = (0:6) * 60, labels = 1:6)) %>% 
    group_by(minute) %>%    # evaluate the following grouped by new factor column
    # for each group, count the rows, subtracting 1 for starting instants, and
    # dividing by 10 to convert from tenths of seconds to secontds
    summarise(elapsed = (n() - n_distinct(Start)) / 10,
              pct_elapsed = elapsed / 60 * 100)    # convert to percent

## # A tibble: 6 × 3
##   minute elapsed pct_elapsed
##             
## 1      1    42.8   71.333333
## 2      2    58.1   96.833333
## 3      3     1.0    1.666667
## 4      4    56.9   94.833333
## 5      5    40.2   67.000000
## 6      6    22.5   37.500000

Note the correction for counting starting instants is imperfect, as it will subtract for every starting instant, even if it is a continuation of a sequence from the previous minute. It could be calculated more thoroughly if precision matters.

A more precise but somewhat difficult route is to add stops and starts at the turn of each minute:

df %>% mutate_all(funs(period_to_seconds(ms(.)))) %>%    # convert to seconds
    gather(var, val) %>%    # gather to long form
    # construct and rbind data.frame of breaks at minute changes
    bind_rows(expand.grid(var = c('Start', 'Stop'), 
                          val = seq(60, by = 60, length.out = floor(max(.$val)/60)))) %>% 
    arrange(val, desc(var)) %>%    # sort
    mutate(index = rep(seq(n()/2), each = 2)) %>%    # make indices for spreading
    spread(var, val) %>%    # spread back to wide form
    mutate(elapsed = Stop - Start) %>%    # calculate elapsed time for each row
    # add and group by factor of which minute each falls in
    group_by(minute = cut(Stop, seq(0, by = 60, length.out = ceiling(max(Stop) / 60 + 1)), 
                        labels = 1:6)) %>% 
    summarise(elapsed = sum(elapsed),    # calculate summaries
              pct_elapsed = elapsed / 60 * 100)

## # A tibble: 6 × 3
##   minute elapsed pct_elapsed
##             
## 1      1    42.8   71.333333
## 2      2    58.2   97.000000
## 3      3     1.1    1.833333
## 4      4    56.9   94.833333
## 5      5    40.3   67.166667
## 6      6    22.6   37.666667

How to de-aggregate time interval data in R?

Answers (2)

Related Questions