Reputation: 135
I have data in the form of start and stop times (in the format minutes:seconds). A simplistic example might be the timestamp of a light turning on, and the subsequent timestamp of the light turning off.
For example:
Start Stop
00:03.1 00:40.9
00:55.0 01:38.2
01:40.0 02:01.1
I would like to rearrange the data so that I can eventually look at it in terms of whole-minute interval bins in R.
Option 1: Turn the data into a binary listing for each tenth of a second, then aggregate the data later by timestamp.
Time.in.sec Yes.or.No
0.0 N
0.1 N
... ...
3.0 N
3.1 Y
3.2 Y
... ...
40.8 Y
40.9 N
... ...
Option 2: Split the time intervals at the minute marks and aggregate total time per minute (starting at time = 0:00.0) with some sort of logical rule.
Start Stop
00:03.10 00:40.90
00:55.00 00:59.99
01:00.00 01:38.20
01:40.00 01:59.99
02:00.00 02:01.10
I have tried looking into lubridate functions (i.e., making each range into an interval class) and cut()
, but I can’t seem to figure out how to make either of these ideas work. I also am unclear whether packages such as zoo would be appropriate for this; honestly, I have very little experience with date/time formats and time series.
Other questions on Stackoverflow seem to be addressing making bins from raw timestamps (e.g., What is an efficient method for partitioning and aggregating intervals from timestamped rows in a data frame? and Aggregate data by equally spaced time intervals in R), but I essentially want to do the opposite.
EDIT 1: Here is a CSV-format of the example data, up through minute 6.
Start, Stop
00:03.1, 00:40.9
00:55.0, 01:38.2
01:40.0, 02:01.1
03:03.1, 04:30.3
04:50.0, 05:01.5
05:08.7, 05:22.0
05:40.1, 05:47.9
EDIT 2: My ultimate goal for this is to have the data in a format that I can use to chunk the observations into standardized time bins (Minute 1, Minute 2, etc.) to get a by-minute percentage of when the data is "Yes". Basically I want to get a summary of the distribution of states by minute, and since the data is binary, I can do this by looking at the "yes" state.
For the first 3 minutes (from 00:00.0 up until 03:00.0), the output would be something like this:
Minute time.yes.sec perc.time.yes
1 42.8 71.33
2 58.2 96.98
3 1.1 1.83
# *NOTE: Here, Minute 1 = [0, 60), Minute 2 = [60, 120), etc.; I'm not opposed
# to the reverse definitions though (Minute 1 = (0, 60], etc.).
I could alternatively look at the data as a cumulative distribution plot, with each successive time point updating the value of "total time yes". However, If I could get the data in the format of option 1, I would have the flexibility to look at the data either way.
Upvotes: 3
Views: 1352
Reputation: 43354
An option, lightly edited from my version in the comments:
library(tidyverse)
library(lubridate)
df %>% mutate_all(funs(period_to_seconds(ms(.)))) %>% # convert each time to seconds
rowwise() %>% # evaluate the following row-by-row
# make a sequence from Start to Stop by 0.1, wrapped in a list
mutate(instant = list(seq(Start, Stop, by = 0.1))) %>%
unnest() %>% # expand list column
# make a factor, cutting instants into 60 second bins
mutate(minute = cut(instant, breaks = (0:6) * 60, labels = 1:6)) %>%
group_by(minute) %>% # evaluate the following grouped by new factor column
# for each group, count the rows, subtracting 1 for starting instants, and
# dividing by 10 to convert from tenths of seconds to secontds
summarise(elapsed = (n() - n_distinct(Start)) / 10,
pct_elapsed = elapsed / 60 * 100) # convert to percent
## # A tibble: 6 × 3
## minute elapsed pct_elapsed
## <fctr> <dbl> <dbl>
## 1 1 42.8 71.333333
## 2 2 58.1 96.833333
## 3 3 1.0 1.666667
## 4 4 56.9 94.833333
## 5 5 40.2 67.000000
## 6 6 22.5 37.500000
Note the correction for counting starting instants is imperfect, as it will subtract for every starting instant, even if it is a continuation of a sequence from the previous minute. It could be calculated more thoroughly if precision matters.
A more precise but somewhat difficult route is to add stops and starts at the turn of each minute:
df %>% mutate_all(funs(period_to_seconds(ms(.)))) %>% # convert to seconds
gather(var, val) %>% # gather to long form
# construct and rbind data.frame of breaks at minute changes
bind_rows(expand.grid(var = c('Start', 'Stop'),
val = seq(60, by = 60, length.out = floor(max(.$val)/60)))) %>%
arrange(val, desc(var)) %>% # sort
mutate(index = rep(seq(n()/2), each = 2)) %>% # make indices for spreading
spread(var, val) %>% # spread back to wide form
mutate(elapsed = Stop - Start) %>% # calculate elapsed time for each row
# add and group by factor of which minute each falls in
group_by(minute = cut(Stop, seq(0, by = 60, length.out = ceiling(max(Stop) / 60 + 1)),
labels = 1:6)) %>%
summarise(elapsed = sum(elapsed), # calculate summaries
pct_elapsed = elapsed / 60 * 100)
## # A tibble: 6 × 3
## minute elapsed pct_elapsed
## <fctr> <dbl> <dbl>
## 1 1 42.8 71.333333
## 2 2 58.2 97.000000
## 3 3 1.1 1.833333
## 4 4 56.9 94.833333
## 5 5 40.3 67.166667
## 6 6 22.6 37.666667
Upvotes: 4
Reputation: 23231
I did the following using your original data prior to the edit:
Start Stop
00:03.1 00:40.9
00:55.0 01:38.2
01:40.0 02:01.1
agg <- read.table(con<-file("clipboard"), header=T)
The ms
function below takes the raw character input I read in from the clipboard and turns changes it into minutes and seconds with an appropriate class, so that it can be used for comparisons. The same is true for the seconds
function, the only difference there being that I'm dealing with data that's just measured in seconds, not minutes and seconds.
agg$Start <- lubridate::ms(agg$Start)
agg$Stop <- lubridate::ms(agg$Stop)
option1 <- data.frame(time = lubridate::seconds(seq(.1, 122, .1)),
flag = as.character("N"), stringsAsFactors = F)
for(i in 1:nrow(agg)){
option1$flag[option1$time > agg$Start[i] & option1$time < agg$Stop[i]] <- "Y"
}
To verify that it worked, let's look at table()
:
table(option1$flag)
N Y 201 1019
option1$minute <- ifelse(option1$time < lubridate::seconds(60), 0, 1)
option1$minute[option1$time > lubridate::seconds(120)] <- 2
table(option1$flag, option1$minute)
0 1 2 N 172 19 10 Y 427 582 10
prop.table(table(option1$flag, option1$minute),2)
0 1 2 N 0.28714524 0.03161398 0.50000000 Y 0.71285476 0.96838602 0.50000000
Upvotes: 3