Reputation: 814
I have an irregular time series of events (posts) using xts
, and I want to calculate the number of events that occur over a rolling weekly window (or biweekly, or 3 day, etc). The data looks like this:
postid
2010-08-04 22:28:07 867
2010-08-04 23:31:12 891
2010-08-04 23:58:05 901
2010-08-05 08:35:50 991
2010-08-05 13:28:02 1085
2010-08-05 14:14:47 1114
2010-08-05 14:21:46 1117
2010-08-05 15:46:24 1151
2010-08-05 16:25:29 1174
2010-08-05 23:19:29 1268
2010-08-06 12:15:42 1384
2010-08-06 15:22:06 1403
2010-08-07 10:25:49 1550
2010-08-07 18:58:16 1596
2010-08-07 21:15:44 1608
which should produce something like
nposts
2010-08-05 00:00:00 10
2010-08-06 00:00:00 9
2010-08-07 00:00:00 5
for a 2-day window. I have looked into rollapply
, apply.rolling
from PerformanceAnalytics
, etc, and they all assume regular time series data. I tried changing all of the times to just the day the the post occurred and using something like ddply
to group on each day, which gets me close. However, a user might not post every day, so the time series will still be irregular. I could fill in the gaps with 0s, but that might inflate my data a lot and it's already quite large.
What should I do?
Upvotes: 11
Views: 3360
Reputation: 2877
With runner one can apply any R function on rolling windows. What OP requires is to calculate function (length) on rolling window only at specified time-points.
Using runner
user needs to specify at
argument to indicate on what time-points output should be calculated. We can just pass vector of time-points to runner
which we created on a side as a POSIXt
sequence.
To make a runner
time-dependent one has to specify idx
by dates corresponding to x
object. Length of the window can be set as k = "2 days"
at <- seq(as.POSIXct("2010-08-05 00:00:00"),
by = "1 days",
length.out = 4)
# [1] "2010-08-05 CEST" "2010-08-06 CEST" "2010-08-07 CEST" "2010-08-08 CEST"
runner::runner(
x = x$postid,
k = "2 days",
idx = x$datetime,
at = at,
f = length
)
# [1] 3 10 9 5
Upvotes: 0
Reputation: 176728
Here's a solution using xts:
x <- structure(c(867L, 891L, 901L, 991L, 1085L, 1114L, 1117L, 1151L,
1174L, 1268L, 1384L, 1403L, 1550L, 1596L, 1608L), .Dim = c(15L, 1L),
index = structure(c(1280960887, 1280964672, 1280966285,
1280997350, 1281014882, 1281017687, 1281018106, 1281023184, 1281025529,
1281050369, 1281096942, 1281108126, 1281176749, 1281207496, 1281215744),
tzone = "", tclass = c("POSIXct", "POSIXt")), class = c("xts", "zoo"),
.indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct", "POSIXt"),
.indexTZ = "", tzone = "")
# first count the number of observations each day
xd <- apply.daily(x, length)
# now sum the counts over a 2-day rolling window
x2d <- rollapply(xd, 2, sum)
# align times at the end of the period (if you want)
y <- align.time(x2d, n=60*60*24) # n is in seconds
Upvotes: 5
Reputation: 2210
This seems to work:
# n = number of days
n <- 30
# w = window width. In this example, w = 7 days
w <- 7
# I will simulate some data to illustrate the procedure
data <- rep(1:n, rpois(n, 2))
# Tabulate the number of occurences per day:
# (use factor() to be sure to have the days with zero observations included)
date.table <- table(factor(data, levels=1:n))
mat <- diag(n)
for (i in 2:w){
dim <- n+i-1
mat <- mat + diag(dim)[-((n+1):dim),-(1:(i-1))]
}
# And the answer is....
roll.mean.7days <- date.table %*% mat
Seems to be not too slow (although the mat
matrix will get dimensions n*n). I tried to replace n=30 with n=3000 (which creates a matrix of 9 million elements = 72 MB) and it still was reasonable fast on my computer. For very big data sets, try on a subset first.... It will also be faster to use some of the functions in the Matrix package (bandSparse) to create the mat
matrix.
Upvotes: 4