Running count records in time series according to a condition

Question

I have a toy dataset with timestamp and price like this one:

time <- c(as.POSIXlt("2017-02-03 09:00:01"),
        as.POSIXlt("2017-02-03 09:00:03"),
        as.POSIXlt("2017-02-03 09:00:06"),
        as.POSIXlt("2017-02-03 09:00:09"),
        as.POSIXlt("2017-02-03 09:00:10"),
       as.POSIXlt("2017-02-03 09:00:20"),
       as.POSIXlt("2017-02-03 09:00:23"),
       as.POSIXlt("2017-02-03 09:00:34"),
       as.POSIXlt("2017-02-03 09:00:44"),
as.POSIXlt("2017-02-03 09:01:07"))

price <- c(100, 100, 100, 99, 98, 99, 100, 101, 101, 100)

data <- data.frame(time, price)

I need to loop through the data.frame, find consecutive sequences of records at the same price, count the number of records in these sequences and calculate the duration in seconds from the first to the last member of the same-price-sequence.

So, for the above example the result is:

start, end, price, nbr_records, duration_sec
2017-02-03 09:00:01, 2017-02-03 09:00:03, 100, 3, 5
2017-02-03 09:00:09, 2017-02-03 09:00:09, 99, 1, 0
2017-02-03 09:00:10, 2017-02-03 09:00:10, 98, 1, 0
2017-02-03 09:00:20, 2017-02-03 09:00:20, 99, 1, 0
2017-02-03 09:00:23, 2017-02-03 09:00:23, 100, 1, 0
2017-02-03 09:00:34, 2017-02-03 09:00:44, 101, 2, 10
2017-02-03 09:01:07, 2017-02-03 09:01:07, 100, 1, 0

Preferably a fast data.table solution as I have a lot of records. Thanks!

Matt Summersgill · Accepted Answer

I deleted my comment, I understand what you're going for after a second read-through.

This is pretty straightforward using rleid() from data.table

library(data.table)

## Note: store times as POSIXct instead of POSIXlt for drastic performance improvement
time <- c(as.POSIXct("2017-02-03 09:00:01"),
          as.POSIXct("2017-02-03 09:00:03"),
          as.POSIXct("2017-02-03 09:00:06"),
          as.POSIXct("2017-02-03 09:00:09"),
          as.POSIXct("2017-02-03 09:00:10"),
          as.POSIXct("2017-02-03 09:00:20"),
          as.POSIXct("2017-02-03 09:00:23"),
          as.POSIXct("2017-02-03 09:00:34"),
          as.POSIXct("2017-02-03 09:00:44"),
          as.POSIXct("2017-02-03 09:01:07"))

price <- c(100, 100, 100, 99, 98, 99, 100, 101, 101, 100)

data <- data.frame(time, price)

## Convert to a data.table
setDT(data)

## Create a summary using a generated counter on the fly with 
## the `rleid` function from data.table to group consecutive
## sequences together and then operate by group. the `.N`
## operator is another special symbol in data.table
## that we can use to return the number of rows in each group
## here. See ?special-symbols to learn more
Summary <- data[, .(start = first(time),
                    end = last(time),
                    nbr_records = .N,
                    duration_sec = as.numeric(last(time)) - as.numeric(first(time))
                    ), by = .(Counter = data.table::rleid(price))]

## Drop the Counter variable assuming you don't need it
Summary[,Counter := NULL]

## Results
print(Summary)

#                  start                 end nbr_records duration_sec
# 1: 2017-02-03 09:00:01 2017-02-03 09:00:06           3            5
# 2: 2017-02-03 09:00:09 2017-02-03 09:00:09           1            0
# 3: 2017-02-03 09:00:10 2017-02-03 09:00:10           1            0
# 4: 2017-02-03 09:00:20 2017-02-03 09:00:20           1            0
# 5: 2017-02-03 09:00:23 2017-02-03 09:00:23           1            0
# 6: 2017-02-03 09:00:34 2017-02-03 09:00:44           2           10
# 7: 2017-02-03 09:01:07 2017-02-03 09:01:07           1            0

Running count records in time series according to a condition

Answers (1)

Related Questions