Reputation: 3862
I have a dataset with parking tickets, their start/end-times and information on where they were bought (group). I need to perform a timeseries analyses to create a forcast of where and when tickets will be bought in the future. In order to do this, I need to convert the format into a timeseries format, with values of how many tickets are valid at a give timepoint.
A sample of my data:
library(lubridate)
timeseries <- data.frame(start = c("2016-12-31 20:42:00",
"2016-12-31 21:41:00",
"2016-12-31 21:15:00",
"2016-12-31 17:19:00",
"2016-12-31 21:47:00",
"2016-12-31 16:58:00"),
end = c("2016-12-31 23:07:00",
"2016-12-31 23:07:00",
"2016-12-31 23:08:00",
"2016-12-31 23:09:00",
"2016-12-31 23:11:00",
"2016-12-31 23:11:00"),
group = c(1,2,1,2,1,2),
stringsAsFactors = FALSE)
timeseries$start <- as.POSIXlt(timeseries$start)
timeseries$end <- as.POSIXlt(timeseries$end)
timeseries$interval <- interval(timeseries$start, timeseries$end, tzone="UTC")
Examples of timeslots I want to aggregate information in (by group):
summary_hours <- data.frame(timeStart = c("2016-12-31 16:00",
"2016-12-31 17:00",
"2016-12-31 18:00",
"2016-12-31 19:00",
"2016-12-31 20:00",
"2016-12-31 21:00",
"2016-12-31 22:00",
"2016-12-31 23:00"),
timeEnd = c("2016-12-31 17:00",
"2016-12-31 18:00",
"2016-12-31 19:00",
"2016-12-31 20:00",
"2016-12-31 21:00",
"2016-12-31 22:00",
"2016-12-31 23:00",
"2017-01-01 00:00"))
summary_hours$timeStart <- as.POSIXlt(summary_hours$timeStart)
summary_hours$timeEnd <- as.POSIXlt(summary_hours$timeEnd)
summary_hours$interval <- interval(summary_hours$timeStart, summary_hours$timeEnd, tzone="UTC")
My current approach, which seems very inefficient, when the dataset spans two years.
library("lubridate")
intersect_in_mins <- function(interval) {
return(as.period(intersect(interval, summary_hours$interval), "minutes")@minute)
}
summary_hours$group1 <- rowSums(t(do.call(rbind, lapply(subset(timeseries, group == 1)$interval, intersect_in_mins))), na.rm = TRUE)
summary_hours$group2 <- rowSums(t(do.call(rbind, lapply(subset(timeseries, group == 2)$interval, intersect_in_mins))), na.rm = TRUE)
summary_hours
timeStart timeEnd interval group1 group2
1 2016-12-31 16:00:00 2016-12-31 17:00:00 2016-12-31 16:00:00 UTC--2016-12-31 17:00:00 UTC 0 2
2 2016-12-31 17:00:00 2016-12-31 18:00:00 2016-12-31 17:00:00 UTC--2016-12-31 18:00:00 UTC 0 101
3 2016-12-31 18:00:00 2016-12-31 19:00:00 2016-12-31 18:00:00 UTC--2016-12-31 19:00:00 UTC 0 120
4 2016-12-31 19:00:00 2016-12-31 20:00:00 2016-12-31 19:00:00 UTC--2016-12-31 20:00:00 UTC 0 120
5 2016-12-31 20:00:00 2016-12-31 21:00:00 2016-12-31 20:00:00 UTC--2016-12-31 21:00:00 UTC 18 120
6 2016-12-31 21:00:00 2016-12-31 22:00:00 2016-12-31 21:00:00 UTC--2016-12-31 22:00:00 UTC 118 139
7 2016-12-31 22:00:00 2016-12-31 23:00:00 2016-12-31 22:00:00 UTC--2016-12-31 23:00:00 UTC 180 180
8 2016-12-31 23:00:00 2017-01-01 00:00:00 2016-12-31 23:00:00 UTC--2017-01-01 00:00:00 UTC 26 27
Do you have any suggestions of nice libraries that can do this kind of magic automatically?
Upvotes: 1
Views: 630
Reputation: 42582
In his comments here and here, the OP has changed the objective of the question. Now, the request is to agregate "minutes of active tickets" for each time interval of an hour.
This requires a completely different approach which justifies to post a separate answer, IMHO.
To check which tickets are active in which time intervals of one hour, the foverlaps()
function from the data.table
package can be used:
library(data.table)
# IMPORTANT for reproducibility in different timezones
Sys.setenv(TZ = "UTC")
# convert timestamps from character to POSIXct
cols <- c("start", "end")
setDT(timeseries)[, (cols) := lapply(.SD, fasttime::fastPOSIXct), .SDcols = cols]
# create sequence of intervals of one hour covering all given times
hours_seq <- timeseries[, {
tmp <- seq(lubridate::floor_date(min(start, end), "hour"),
lubridate::ceiling_date(max(start, end), "hour"),
by = "1 hour")
.(start = head(tmp, -1L), end = tail(tmp, -1L))
}]
hours_seq
start end 1: 2016-12-31 16:00:00 2016-12-31 17:00:00 2: 2016-12-31 17:00:00 2016-12-31 18:00:00 3: 2016-12-31 18:00:00 2016-12-31 19:00:00 4: 2016-12-31 19:00:00 2016-12-31 20:00:00 5: 2016-12-31 20:00:00 2016-12-31 21:00:00 6: 2016-12-31 21:00:00 2016-12-31 22:00:00 7: 2016-12-31 22:00:00 2016-12-31 23:00:00 8: 2016-12-31 23:00:00 2017-01-01 00:00:00
# split up given ticket intervals in hour pieces
foverlaps(hours_seq, setkey(timeseries, start, end), nomatch = 0L)[
# compute active minutes and aggregate
, .(cnt_active_tickets = .N,
sum_active_minutes = sum(as.integer(
difftime(pmin(end, i.end), pmax(start, i.start), units = "mins")))),
keyby = .(group, interval_start = i.start, interval_end = i.end)]
group interval_start interval_end cnt_active_tickets sum_active_minutes 1: 1 2016-12-31 20:00:00 2016-12-31 21:00:00 1 18 2: 1 2016-12-31 21:00:00 2016-12-31 22:00:00 3 118 3: 1 2016-12-31 22:00:00 2016-12-31 23:00:00 3 180 4: 1 2016-12-31 23:00:00 2017-01-01 00:00:00 3 26 5: 2 2016-12-31 16:00:00 2016-12-31 17:00:00 1 2 6: 2 2016-12-31 17:00:00 2016-12-31 18:00:00 2 101 7: 2 2016-12-31 18:00:00 2016-12-31 19:00:00 2 120 8: 2 2016-12-31 19:00:00 2016-12-31 20:00:00 2 120 9: 2 2016-12-31 20:00:00 2016-12-31 21:00:00 2 120 10: 2 2016-12-31 21:00:00 2016-12-31 22:00:00 3 139 11: 2 2016-12-31 22:00:00 2016-12-31 23:00:00 3 180 12: 2 2016-12-31 23:00:00 2017-01-01 00:00:00 3 27
Note that this approach also considers "short-term parkers", i.e., tickets which are active for less than an hour and start after the full hour and end before the next full hour.
If the result should be presented with the values for each group
side by side, the data can be reshaped from long to wide format using dcast()
:
foverlaps(hours_seq, setkey(timeseries, start, end), nomatch = 0L)[
, active_minutes := as.integer(
difftime(pmin(end, i.end), pmax(start, i.start), units = "mins"))][
, dcast(.SD, i.start + i.end ~ paste0("group", group), sum)]
i.start i.end group1 group2 1: 2016-12-31 16:00:00 2016-12-31 17:00:00 0 2 2: 2016-12-31 17:00:00 2016-12-31 18:00:00 0 101 3: 2016-12-31 18:00:00 2016-12-31 19:00:00 0 120 4: 2016-12-31 19:00:00 2016-12-31 20:00:00 0 120 5: 2016-12-31 20:00:00 2016-12-31 21:00:00 18 120 6: 2016-12-31 21:00:00 2016-12-31 22:00:00 118 139 7: 2016-12-31 22:00:00 2016-12-31 23:00:00 180 180 8: 2016-12-31 23:00:00 2017-01-01 00:00:00 26 27
Upvotes: 3
Reputation: 42582
The OP has requested to count how many tickets are valid at a give timepoint.
This can be achieved using a non-equi join
of the start and end dates with a continuous sequence of fixed hourly timepoints:
library(data.table)
# IMPORTANT for reproducibility in different timezones
Sys.setenv(TZ = "UTC")
# convert timestamps from character to POSIXct
cols <- c("start", "end")
setDT(timeseries)[, (cols) := lapply(.SD, fasttime::fastPOSIXct), .SDcols = cols]
# add id to each row (required to count the active tickets later)
timeseries[, rn := .I]
# print data for ilustration
timeseries[order(group, start, end)]
start end group rn 1: 2016-12-31 20:42:00 2016-12-31 23:07:00 1 1 2: 2016-12-31 21:15:00 2016-12-31 23:08:00 1 3 3: 2016-12-31 21:47:00 2016-12-31 23:11:00 1 5 4: 2016-12-31 16:58:00 2016-12-31 23:11:00 2 6 5: 2016-12-31 17:19:00 2016-12-31 23:09:00 2 4 6: 2016-12-31 21:41:00 2016-12-31 23:07:00 2 2
# create sequence of hourly timepoints
hours_seq <- timeseries[, seq(lubridate::floor_date(min(start, end), "hour"),
lubridate::ceiling_date(max(start, end), "hour"),
by = "1 hour")]
hours_seq
[1] "2016-12-31 16:00:00 UTC" "2016-12-31 17:00:00 UTC" "2016-12-31 18:00:00 UTC" "2016-12-31 19:00:00 UTC" [5] "2016-12-31 20:00:00 UTC" "2016-12-31 21:00:00 UTC" "2016-12-31 22:00:00 UTC" "2016-12-31 23:00:00 UTC" [9] "2017-01-01 00:00:00 UTC"
# non-equi join
timeseries[.(hr = hours_seq), on = .(start <= hr, end > hr), nomatch = 0L,
allow.cartesian = TRUE][
# count number of active tickets at timepoint and by group
, .(n.active.tickets = uniqueN(rn)), keyby = .(group, timepoint = start)]
group timepoint n.active.tickets 1: 1 2016-12-31 21:00:00 1 2: 1 2016-12-31 22:00:00 3 3: 1 2016-12-31 23:00:00 3 4: 2 2016-12-31 17:00:00 1 5: 2 2016-12-31 18:00:00 2 6: 2 2016-12-31 19:00:00 2 7: 2 2016-12-31 20:00:00 2 8: 2 2016-12-31 21:00:00 2 9: 2 2016-12-31 22:00:00 3 10: 2 2016-12-31 23:00:00 3
Upvotes: 2