Reputation: 370
I would like to take a mean of a 7 day rolling window with 1 day increments of data that is collected at 30 minute intervals.
I have tried using data.table
with by
conditional statement with no success. Any guidane would be greatly appreciated.
# packages
library(data.table)
library(lubridate)
# Set set.seed to have reproducible sampling
set.seed(42)
# Create some Data
start = ymd_hms("2014-01-01 00:00:00")
end = ymd_hms("2014-12-31 23:59:59")
# Create data with 30 minute intervals.
dat <- data.table(timestamp = seq(start, end, by = "30 min"),
sample1 = sample(1:20, 17520, replace = TRUE))
# Create date variable for merging datasets.
dat[, date := as.Date(timestamp)]
# Create data for 7 day window moving window with one day increments.
dat2 <- data.table(start = seq(start, end, by = "1 day"),
end = seq(start + days(7), end + days(7), by = "1 day"))
# Create date variable for merging datasets.
dat2[, date := as.Date(start)]
# mergre datasets.
dat <- merge(dat, dat2, by="date")
# Tried
dat[, .(sample.mean = mean(sample1)), by = .(timestamp >= start & timestamp < end)]
# timestamp sample.mean
# 1: TRUE 10.46638
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %in% c(start:end))]
# timestamp sample.mean
# 1: TRUE 10.40059
# 2: FALSE 10.46767
# Warning messages:
# 1: In start:end :
# numerical expression has 17520 elements: only the first used
# 2: In start:end :
# numerical expression has 17520 elements: only the first used
dat[, .(sample.mean = mean(sample1)), by = .(timestamp %between% c(start, end))]
# timestamp sample.mean
# 1: TRUE 19.00000
# 2: FALSE 10.46589
Upvotes: 0
Views: 660
Reputation: 66819
Here's one approach:
library(zoo)
daymeans = dat[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat[daymeans, rmean := i.rmean, on="date"]
This assumes that your data is already sorted by date
; if not, use keyby=date
instead of by=date
. If you don't want to juggle intermediate objects, there is a one-liner:
# Michael Chirico's suggestion from the comments
dat[dat[, mean(sample1), by=date][, rollmean(V1, 7, fill=NA)], rmean := i.V1, on = "date"]
You may need to tweak the arguments to rollmean
to fit your particular definition of the window. @eddi suggested that runmean
from the caTools library is typically faster than zoo's rollmean
and so is probably also worth a look.
Crude benchmark with the OP's example data:
dat2 = copy(dat)
# Michael's answer
system.time({
setkey(dat, date)
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
mean(sample1), nomatch = 0L], by = date]
})
user system elapsed
0.33 0.00 0.35
# this answer
system.time({
daymeans = dat2[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat2[daymeans, rmean := i.rmean, on="date"]
})
user system elapsed
0 0 0
Why it's faster: Here, we're computing 365 means of 48 numbers and then a rolling mean of length 365; which is less computationally costly than making 365 merges to find 48*7 numbers and then taking the mean of the latter.
Upvotes: 2
Reputation: 34703
I'm not 100% sure I understand your exact parameters, but here's the basic approach:
setkey(dat, date)
#pull the 7 previous days
dat[ , dat[.(seq(.BY$date - 7L,
.BY$date, by = "day")),
#nomatch = 0L will exclude any requested dates outside the interval
mean(sample1), nomatch = 0L], by = date]
# date V1
# 1: 2014-01-01 12.31250
# 2: 2014-01-02 10.94792
# 3: 2014-01-03 11.27083
# 4: 2014-01-04 11.10417
# 5: 2014-01-05 10.79167
# ---
# 361: 2014-12-27 10.50260
# 362: 2014-12-28 10.52344
# 363: 2014-12-29 10.05990
# 364: 2014-12-30 10.03906
# 365: 2014-12-31 10.38542
Some possible tinkers:
Change 7L
to whatever window you'd like; use positive if you want forward-looking averages
If you want to go by
timestamp
, you'll have to adjust the 7L
to match whatever units (seconds/minutes/hours/etc)
The extreme points of the interval are not technically correct since the window is shorter than requested; exclude nomatch
and these points will return as NA
Use .(var = mean(sample1))
to name the output column var
.
Upvotes: 2