Take a daily rolling mean of a seven day window for 30 minute sampled data

Question

I would like to take a mean of a 7 day rolling window with 1 day increments of data that is collected at 30 minute intervals. I have tried using data.table with by conditional statement with no success. Any guidane would be greatly appreciated.

# packages
library(data.table)
library(lubridate)

# Set set.seed to have reproducible sampling 
set.seed(42)

# Create some Data
start = ymd_hms("2014-01-01 00:00:00")
end = ymd_hms("2014-12-31 23:59:59")

# Create data with 30 minute intervals.
dat <- data.table(timestamp = seq(start, end, by = "30 min"),
                  sample1 = sample(1:20, 17520, replace = TRUE))

# Create date variable for merging datasets.
dat[, date := as.Date(timestamp)]

# Create data for 7 day window moving window with one day increments.
dat2 <- data.table(start = seq(start, end, by = "1 day"),
                  end = seq(start + days(7), end + days(7), by = "1 day"))

# Create date variable for merging datasets.
dat2[, date := as.Date(start)]

# mergre datasets.
dat <- merge(dat, dat2, by="date")

# Tried 
dat[, .(sample.mean = mean(sample1)), by = .(timestamp >= start & timestamp < end)]
#    timestamp sample.mean
# 1:      TRUE    10.46638

dat[, .(sample.mean = mean(sample1)), by = .(timestamp %in% c(start:end))]
#    timestamp sample.mean
# 1:      TRUE    10.40059
# 2:     FALSE    10.46767
#  Warning messages:
# 1: In start:end :
#  numerical expression has 17520 elements: only the first used
# 2: In start:end :
#   numerical expression has 17520 elements: only the first used

dat[, .(sample.mean = mean(sample1)), by = .(timestamp %between% c(start, end))]
#    timestamp sample.mean
# 1:      TRUE    19.00000
# 2:     FALSE    10.46589

MichaelChirico · Accepted Answer

I'm not 100% sure I understand your exact parameters, but here's the basic approach:

setkey(dat, date)

#pull the 7 previous days 
dat[ , dat[.(seq(.BY$date - 7L,
                 .BY$date, by = "day")),  
           #nomatch = 0L will exclude any requested dates outside the interval
           mean(sample1), nomatch = 0L], by = date]
#            date       V1
#   1: 2014-01-01 12.31250
#   2: 2014-01-02 10.94792
#   3: 2014-01-03 11.27083
#   4: 2014-01-04 11.10417
#   5: 2014-01-05 10.79167
#  ---                    
# 361: 2014-12-27 10.50260
# 362: 2014-12-28 10.52344
# 363: 2014-12-29 10.05990
# 364: 2014-12-30 10.03906
# 365: 2014-12-31 10.38542

Some possible tinkers:

Change 7L to whatever window you'd like; use positive if you want forward-looking averages
If you want to go by timestamp, you'll have to adjust the 7L to match whatever units (seconds/minutes/hours/etc)
The extreme points of the interval are not technically correct since the window is shorter than requested; exclude nomatch and these points will return as NA
Use .(var = mean(sample1)) to name the output column var.

Take a daily rolling mean of a seven day window for 30 minute sampled data

Answers (2)

Related Questions