Haroon Lone
Haroon Lone

Reputation: 2949

Format historical data for forecasting with calendar variables

I have hourly time series data for the year 2015. This data corresponds to power consumption of a big commercial building. I want to use this data to predict the usage for the year 2016. To develop a forecasting model, I need to format this data in a suitable format.

I am planning to use following features to predict the 2016 usage: (1) day of week, (2) time of the day (3) temperature, (4) year 2015 usage. I am able to create the first 3 features but the fourth one seems tricky.

How should I arrange the 2015 data so that for a particular day of 2016 I can use the corresponding day data of year 2015. My concern is :

Here, I have created dummy data corresponding to the year 2015 and 2016.

library(xts)
set.seed(123)
seq1 <- seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-12-31"), by = "hour")
data1 <- xts(rnorm(length(seq1),150,5),seq1)
seq2 <- seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-09-30"), by = "hour")
data2 <- xts(rnorm(length(seq2),140,5),seq2)

Let me give an example to clarify my problem:

  1. Suppose model is: lm( output ~ dayofweek + timeofday + temperature + lastyearusage, data = xxx)
  2. Now suppose I want to predict the usage on 2 oct 2016(dayY), using the lastyearusage onm2 oct 2015(dayX). In this step, issue is 1) How should I ensure thatdayX is not a weekend day if dayY is a working day. I am sure that in this case if I use dayX to predict dayY without keeping a check on day type output will get messy.

Upvotes: 3

Views: 202

Answers (1)

agenis
agenis

Reputation: 8377

There might be already a function in a package to do this, but post here a custom function to add all these kinds of calendar variables (including the week-end info) to a data.frame containing a date/hour column. Fake data:

df <- data.frame(datetime=seq(as.POSIXlt("2013/01/01 00:00:00"), as.POSIXlt("2013/12/31 23:00:00"), by="hour"), variable=rnorm(8760))
####              datetime    variable
#### 1 2013-01-01 00:00:00  1.68959052
#### 2 2013-01-01 01:00:00  0.02023722
#### 3 2013-01-01 02:00:00 -0.42080942

The code for the function:

CreateCalendarVariables = function(df, id_column=NULL) {
  df <- data.frame(df)
  if (is.null(id_column)) stop("Id column for the datetime variable is a mandatory argument")
  temp <- df[, id_column]
  if (  !(class(temp)[1] %in% c("Date", "POSIXct", "POSIXt", "POSIXlt"))  ){
    stop("the indicated datetime variable doesn't have the suitable format")
  }
  require(lubridate)
  df['year']      <- year(temp)
  df['.quarter']  <- quarter(temp)
  df['.month']       <- month(temp)
  df['.week']    <- week(temp)
  df['.DMY']        <- as.Date(temp)
  df['.dayinyear']     <- yday(temp)
  df['.dayinmonth']   <- mday(temp)
  df['.weekday']    <- wday(temp, label=T, abbr=FALSE) %>% factor(., levels=levels(.)[c(2,3,4,5,6,7,1)])
  df['.is_we']      <- df$.weekday %in% c("Saturday", "Sunday")
  if(class(temp)[1] != "Date"){
    df['.hour']    <- factor(hour(temp))
  }
  return(df)
}

Then you just have to specify the N° of column containing the date format. If you need for your model these variables in factor format, feel free to adapt the code:

CreateCalendarVariables(df, 2)
#### Error in CreateCalendarVariables(df, 2) : 
####   the indicated datetime variable doesn't have the suitable format
CreateCalendarVariables(df, 1)
####              datetime    variable year .quarter .month .week       .DMY .dayinyear .dayinmonth .weekday .is_we .hour
#### 1 2013-01-01 00:00:00  1.68959052 2013        1      1     1 2012-12-31          1           1  Tuesday  FALSE     0
#### 2 2013-01-01 01:00:00  0.02023722 2013        1      1     1 2013-01-01          1           1  Tuesday  FALSE     1

To answer your last question, If an entire level is missing from the calibration dataset (i.e. one whole weed and you're using .Week as a predictor), you 'll need to impute the data first.

Upvotes: 0

Related Questions