Esben Eickhardt
Esben Eickhardt

Reputation: 3862

Aggregate timeseries intervals by hour

I have a dataset with parking tickets, their start/end-times and information on where they were bought (group). I need to perform a timeseries analyses to create a forcast of where and when tickets will be bought in the future. In order to do this, I need to convert the format into a timeseries format, with values of how many tickets are valid at a give timepoint.

A sample of my data:

library(lubridate)
timeseries <- data.frame(start = c("2016-12-31 20:42:00",
                                   "2016-12-31 21:41:00",
                                   "2016-12-31 21:15:00",
                                   "2016-12-31 17:19:00",
                                   "2016-12-31 21:47:00",
                                   "2016-12-31 16:58:00"),
                         end = c("2016-12-31 23:07:00",
                                 "2016-12-31 23:07:00",
                                 "2016-12-31 23:08:00",
                                 "2016-12-31 23:09:00",
                                 "2016-12-31 23:11:00",
                                 "2016-12-31 23:11:00"),
                         group = c(1,2,1,2,1,2),
                         stringsAsFactors = FALSE)
timeseries$start <- as.POSIXlt(timeseries$start)
timeseries$end <- as.POSIXlt(timeseries$end)
timeseries$interval <- interval(timeseries$start, timeseries$end, tzone="UTC")

Examples of timeslots I want to aggregate information in (by group):

summary_hours <- data.frame(timeStart = c("2016-12-31 16:00",
                                          "2016-12-31 17:00",
                                          "2016-12-31 18:00",
                                          "2016-12-31 19:00",
                                          "2016-12-31 20:00",
                                          "2016-12-31 21:00",
                                          "2016-12-31 22:00",
                                          "2016-12-31 23:00"),
                            timeEnd = c("2016-12-31 17:00",
                                        "2016-12-31 18:00",
                                        "2016-12-31 19:00",
                                        "2016-12-31 20:00",
                                        "2016-12-31 21:00",
                                        "2016-12-31 22:00",
                                        "2016-12-31 23:00",
                                        "2017-01-01 00:00"))
summary_hours$timeStart <- as.POSIXlt(summary_hours$timeStart)
summary_hours$timeEnd <- as.POSIXlt(summary_hours$timeEnd)
summary_hours$interval <- interval(summary_hours$timeStart, summary_hours$timeEnd, tzone="UTC")

My current approach, which seems very inefficient, when the dataset spans two years.

library("lubridate")
intersect_in_mins <- function(interval) {
  return(as.period(intersect(interval, summary_hours$interval), "minutes")@minute)
}

summary_hours$group1 <- rowSums(t(do.call(rbind, lapply(subset(timeseries, group == 1)$interval, intersect_in_mins))), na.rm = TRUE)
summary_hours$group2 <- rowSums(t(do.call(rbind, lapply(subset(timeseries, group == 2)$interval, intersect_in_mins))), na.rm = TRUE)

summary_hours
            timeStart             timeEnd                                         interval group1 group2
1 2016-12-31 16:00:00 2016-12-31 17:00:00 2016-12-31 16:00:00 UTC--2016-12-31 17:00:00 UTC      0      2
2 2016-12-31 17:00:00 2016-12-31 18:00:00 2016-12-31 17:00:00 UTC--2016-12-31 18:00:00 UTC      0    101
3 2016-12-31 18:00:00 2016-12-31 19:00:00 2016-12-31 18:00:00 UTC--2016-12-31 19:00:00 UTC      0    120
4 2016-12-31 19:00:00 2016-12-31 20:00:00 2016-12-31 19:00:00 UTC--2016-12-31 20:00:00 UTC      0    120
5 2016-12-31 20:00:00 2016-12-31 21:00:00 2016-12-31 20:00:00 UTC--2016-12-31 21:00:00 UTC     18    120
6 2016-12-31 21:00:00 2016-12-31 22:00:00 2016-12-31 21:00:00 UTC--2016-12-31 22:00:00 UTC    118    139
7 2016-12-31 22:00:00 2016-12-31 23:00:00 2016-12-31 22:00:00 UTC--2016-12-31 23:00:00 UTC    180    180
8 2016-12-31 23:00:00 2017-01-01 00:00:00 2016-12-31 23:00:00 UTC--2017-01-01 00:00:00 UTC     26     27

Do you have any suggestions of nice libraries that can do this kind of magic automatically?

Upvotes: 1

Views: 630

Answers (2)

Uwe
Uwe

Reputation: 42582

In his comments here and here, the OP has changed the objective of the question. Now, the request is to agregate "minutes of active tickets" for each time interval of an hour.

This requires a completely different approach which justifies to post a separate answer, IMHO.

To check which tickets are active in which time intervals of one hour, the foverlaps() function from the data.table package can be used:

library(data.table)
# IMPORTANT for reproducibility in different timezones
Sys.setenv(TZ = "UTC")
# convert timestamps from character to POSIXct
cols <- c("start", "end")
setDT(timeseries)[, (cols) := lapply(.SD, fasttime::fastPOSIXct), .SDcols = cols]

# create sequence of intervals of one hour covering all given times
hours_seq <- timeseries[, {
  tmp <- seq(lubridate::floor_date(min(start, end), "hour"),
             lubridate::ceiling_date(max(start, end), "hour"), 
             by = "1 hour")
  .(start = head(tmp, -1L), end = tail(tmp, -1L))
  }]
hours_seq
                 start                 end
1: 2016-12-31 16:00:00 2016-12-31 17:00:00
2: 2016-12-31 17:00:00 2016-12-31 18:00:00
3: 2016-12-31 18:00:00 2016-12-31 19:00:00
4: 2016-12-31 19:00:00 2016-12-31 20:00:00
5: 2016-12-31 20:00:00 2016-12-31 21:00:00
6: 2016-12-31 21:00:00 2016-12-31 22:00:00
7: 2016-12-31 22:00:00 2016-12-31 23:00:00
8: 2016-12-31 23:00:00 2017-01-01 00:00:00
# split up given ticket intervals in hour pieces 
foverlaps(hours_seq, setkey(timeseries, start, end), nomatch = 0L)[
  # compute active minutes and aggregate
  , .(cnt_active_tickets = .N, 
      sum_active_minutes = sum(as.integer(
        difftime(pmin(end, i.end), pmax(start, i.start), units = "mins")))), 
    keyby = .(group, interval_start = i.start, interval_end = i.end)]
    group      interval_start        interval_end cnt_active_tickets sum_active_minutes
 1:     1 2016-12-31 20:00:00 2016-12-31 21:00:00                  1                 18
 2:     1 2016-12-31 21:00:00 2016-12-31 22:00:00                  3                118
 3:     1 2016-12-31 22:00:00 2016-12-31 23:00:00                  3                180
 4:     1 2016-12-31 23:00:00 2017-01-01 00:00:00                  3                 26
 5:     2 2016-12-31 16:00:00 2016-12-31 17:00:00                  1                  2
 6:     2 2016-12-31 17:00:00 2016-12-31 18:00:00                  2                101
 7:     2 2016-12-31 18:00:00 2016-12-31 19:00:00                  2                120
 8:     2 2016-12-31 19:00:00 2016-12-31 20:00:00                  2                120
 9:     2 2016-12-31 20:00:00 2016-12-31 21:00:00                  2                120
10:     2 2016-12-31 21:00:00 2016-12-31 22:00:00                  3                139
11:     2 2016-12-31 22:00:00 2016-12-31 23:00:00                  3                180
12:     2 2016-12-31 23:00:00 2017-01-01 00:00:00                  3                 27

Note that this approach also considers "short-term parkers", i.e., tickets which are active for less than an hour and start after the full hour and end before the next full hour.

Output in wide format

If the result should be presented with the values for each group side by side, the data can be reshaped from long to wide format using dcast():

foverlaps(hours_seq, setkey(timeseries, start, end), nomatch = 0L)[
  , active_minutes := as.integer(
    difftime(pmin(end, i.end), pmax(start, i.start), units = "mins"))][
      , dcast(.SD, i.start + i.end ~ paste0("group", group), sum)]
               i.start               i.end group1 group2
1: 2016-12-31 16:00:00 2016-12-31 17:00:00      0      2
2: 2016-12-31 17:00:00 2016-12-31 18:00:00      0    101
3: 2016-12-31 18:00:00 2016-12-31 19:00:00      0    120
4: 2016-12-31 19:00:00 2016-12-31 20:00:00      0    120
5: 2016-12-31 20:00:00 2016-12-31 21:00:00     18    120
6: 2016-12-31 21:00:00 2016-12-31 22:00:00    118    139
7: 2016-12-31 22:00:00 2016-12-31 23:00:00    180    180
8: 2016-12-31 23:00:00 2017-01-01 00:00:00     26     27

Upvotes: 3

Uwe
Uwe

Reputation: 42582

The OP has requested to count how many tickets are valid at a give timepoint.

This can be achieved using a non-equi join of the start and end dates with a continuous sequence of fixed hourly timepoints:

library(data.table)
# IMPORTANT for reproducibility in different timezones
Sys.setenv(TZ = "UTC")

# convert timestamps from character to POSIXct
cols <- c("start", "end")
setDT(timeseries)[, (cols) := lapply(.SD, fasttime::fastPOSIXct), .SDcols = cols]
# add id to each row (required to count the active tickets later)
timeseries[, rn := .I]
# print data for ilustration
timeseries[order(group, start, end)]
                 start                 end group rn
1: 2016-12-31 20:42:00 2016-12-31 23:07:00     1  1
2: 2016-12-31 21:15:00 2016-12-31 23:08:00     1  3
3: 2016-12-31 21:47:00 2016-12-31 23:11:00     1  5
4: 2016-12-31 16:58:00 2016-12-31 23:11:00     2  6
5: 2016-12-31 17:19:00 2016-12-31 23:09:00     2  4
6: 2016-12-31 21:41:00 2016-12-31 23:07:00     2  2
# create sequence of hourly timepoints
hours_seq <- timeseries[, seq(lubridate::floor_date(min(start, end), "hour"),
                              lubridate::ceiling_date(max(start, end), "hour"), 
                              by = "1 hour")]
hours_seq
[1] "2016-12-31 16:00:00 UTC" "2016-12-31 17:00:00 UTC" "2016-12-31 18:00:00 UTC" "2016-12-31 19:00:00 UTC"
[5] "2016-12-31 20:00:00 UTC" "2016-12-31 21:00:00 UTC" "2016-12-31 22:00:00 UTC" "2016-12-31 23:00:00 UTC"
[9] "2017-01-01 00:00:00 UTC"
# non-equi join
timeseries[.(hr = hours_seq), on = .(start <= hr, end > hr), nomatch = 0L,
           allow.cartesian = TRUE][
             # count number of active tickets at timepoint and by group
             , .(n.active.tickets = uniqueN(rn)), keyby = .(group, timepoint = start)]
    group           timepoint n.active.tickets
 1:     1 2016-12-31 21:00:00                1
 2:     1 2016-12-31 22:00:00                3
 3:     1 2016-12-31 23:00:00                3
 4:     2 2016-12-31 17:00:00                1
 5:     2 2016-12-31 18:00:00                2
 6:     2 2016-12-31 19:00:00                2
 7:     2 2016-12-31 20:00:00                2
 8:     2 2016-12-31 21:00:00                2
 9:     2 2016-12-31 22:00:00                3
10:     2 2016-12-31 23:00:00                3

Upvotes: 2

Related Questions