user3498523
user3498523

Reputation: 370

Take a daily rolling mean of a seven day window for 30 minute sampled data

I would like to take a mean of a 7 day rolling window with 1 day increments of data that is collected at 30 minute intervals. I have tried using data.table with by conditional statement with no success. Any guidane would be greatly appreciated.

# packages
library(data.table)
library(lubridate)

# Set set.seed to have reproducible sampling 
set.seed(42)

# Create some Data
start = ymd_hms("2014-01-01 00:00:00")
end = ymd_hms("2014-12-31 23:59:59")

# Create data with 30 minute intervals.
dat <- data.table(timestamp = seq(start, end, by = "30 min"),
                  sample1 = sample(1:20, 17520, replace = TRUE))

# Create date variable for merging datasets.
dat[, date := as.Date(timestamp)]

# Create data for 7 day window moving window with one day increments.
dat2 <- data.table(start = seq(start, end, by = "1 day"),
                  end = seq(start + days(7), end + days(7), by = "1 day"))

# Create date variable for merging datasets.
dat2[, date := as.Date(start)]

# mergre datasets.
dat <- merge(dat, dat2, by="date")

# Tried 
dat[, .(sample.mean = mean(sample1)), by = .(timestamp >= start & timestamp < end)]
#    timestamp sample.mean
# 1:      TRUE    10.46638

dat[, .(sample.mean = mean(sample1)), by = .(timestamp %in% c(start:end))]
#    timestamp sample.mean
# 1:      TRUE    10.40059
# 2:     FALSE    10.46767
#  Warning messages:
# 1: In start:end :
#  numerical expression has 17520 elements: only the first used
# 2: In start:end :
#   numerical expression has 17520 elements: only the first used

dat[, .(sample.mean = mean(sample1)), by = .(timestamp %between% c(start, end))]
#    timestamp sample.mean
# 1:      TRUE    19.00000
# 2:     FALSE    10.46589

Upvotes: 0

Views: 660

Answers (2)

Frank
Frank

Reputation: 66819

Here's one approach:

library(zoo)
daymeans = dat[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat[daymeans, rmean := i.rmean, on="date"]

This assumes that your data is already sorted by date; if not, use keyby=date instead of by=date. If you don't want to juggle intermediate objects, there is a one-liner:

# Michael Chirico's suggestion from the comments
dat[dat[, mean(sample1), by=date][, rollmean(V1, 7, fill=NA)], rmean := i.V1, on = "date"]

You may need to tweak the arguments to rollmean to fit your particular definition of the window. @eddi suggested that runmean from the caTools library is typically faster than zoo's rollmean and so is probably also worth a look.


Crude benchmark with the OP's example data:

dat2 = copy(dat)

# Michael's answer
system.time({
setkey(dat, date)
dat[ , dat[.(seq(.BY$date - 7L,
                 .BY$date, by = "day")),  
           mean(sample1), nomatch = 0L], by = date]
})

   user  system elapsed 
   0.33    0.00    0.35

# this answer
system.time({
daymeans = dat2[, mean(sample1), by=date][, rmean := rollmean(V1, 7, fill=NA)]
dat2[daymeans, rmean := i.rmean, on="date"]
})

   user  system elapsed 
      0       0       0 

Why it's faster: Here, we're computing 365 means of 48 numbers and then a rolling mean of length 365; which is less computationally costly than making 365 merges to find 48*7 numbers and then taking the mean of the latter.

Upvotes: 2

MichaelChirico
MichaelChirico

Reputation: 34703

I'm not 100% sure I understand your exact parameters, but here's the basic approach:

setkey(dat, date)

#pull the 7 previous days 
dat[ , dat[.(seq(.BY$date - 7L,
                 .BY$date, by = "day")),  
           #nomatch = 0L will exclude any requested dates outside the interval
           mean(sample1), nomatch = 0L], by = date]
#            date       V1
#   1: 2014-01-01 12.31250
#   2: 2014-01-02 10.94792
#   3: 2014-01-03 11.27083
#   4: 2014-01-04 11.10417
#   5: 2014-01-05 10.79167
#  ---                    
# 361: 2014-12-27 10.50260
# 362: 2014-12-28 10.52344
# 363: 2014-12-29 10.05990
# 364: 2014-12-30 10.03906
# 365: 2014-12-31 10.38542

Some possible tinkers:

  • Change 7L to whatever window you'd like; use positive if you want forward-looking averages

  • If you want to go by timestamp, you'll have to adjust the 7L to match whatever units (seconds/minutes/hours/etc)

  • The extreme points of the interval are not technically correct since the window is shorter than requested; exclude nomatch and these points will return as NA

  • Use .(var = mean(sample1)) to name the output column var.

Upvotes: 2

Related Questions