Reputation: 2949
I have hourly time series data for the year 2015. This data corresponds to power consumption of a big commercial building. I want to use this data to predict the usage for the year 2016. To develop a forecasting model, I need to format this data in a suitable format.
I am planning to use following features to predict the 2016 usage: (1) day of week, (2) time of the day (3) temperature, (4) year 2015 usage. I am able to create the first 3 features but the fourth one seems tricky.
How should I arrange the 2015 data so that for a particular day of 2016 I can use the corresponding day data of year 2015. My concern is :
Here, I have created dummy data corresponding to the year 2015 and 2016.
library(xts)
set.seed(123)
seq1 <- seq(as.POSIXct("2015-01-01"),as.POSIXct("2015-12-31"), by = "hour")
data1 <- xts(rnorm(length(seq1),150,5),seq1)
seq2 <- seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-09-30"), by = "hour")
data2 <- xts(rnorm(length(seq2),140,5),seq2)
Let me give an example to clarify my problem:
lm( output ~ dayofweek + timeofday + temperature + lastyearusage, data = xxx)
dayY
), using the lastyearusage
onm2 oct 2015(dayX
). In this step, issue is 1) How should I ensure thatdayX
is not a weekend day if dayY
is a working day. I am sure that in this case if I use dayX
to predict dayY
without keeping a check on day type output will get messy. Upvotes: 3
Views: 202
Reputation: 8377
There might be already a function in a package to do this, but post here a custom function to add all these kinds of calendar variables (including the week-end info) to a data.frame containing a date/hour column. Fake data:
df <- data.frame(datetime=seq(as.POSIXlt("2013/01/01 00:00:00"), as.POSIXlt("2013/12/31 23:00:00"), by="hour"), variable=rnorm(8760))
#### datetime variable
#### 1 2013-01-01 00:00:00 1.68959052
#### 2 2013-01-01 01:00:00 0.02023722
#### 3 2013-01-01 02:00:00 -0.42080942
The code for the function:
CreateCalendarVariables = function(df, id_column=NULL) {
df <- data.frame(df)
if (is.null(id_column)) stop("Id column for the datetime variable is a mandatory argument")
temp <- df[, id_column]
if ( !(class(temp)[1] %in% c("Date", "POSIXct", "POSIXt", "POSIXlt")) ){
stop("the indicated datetime variable doesn't have the suitable format")
}
require(lubridate)
df['year'] <- year(temp)
df['.quarter'] <- quarter(temp)
df['.month'] <- month(temp)
df['.week'] <- week(temp)
df['.DMY'] <- as.Date(temp)
df['.dayinyear'] <- yday(temp)
df['.dayinmonth'] <- mday(temp)
df['.weekday'] <- wday(temp, label=T, abbr=FALSE) %>% factor(., levels=levels(.)[c(2,3,4,5,6,7,1)])
df['.is_we'] <- df$.weekday %in% c("Saturday", "Sunday")
if(class(temp)[1] != "Date"){
df['.hour'] <- factor(hour(temp))
}
return(df)
}
Then you just have to specify the N° of column containing the date format. If you need for your model these variables in factor format, feel free to adapt the code:
CreateCalendarVariables(df, 2)
#### Error in CreateCalendarVariables(df, 2) :
#### the indicated datetime variable doesn't have the suitable format
CreateCalendarVariables(df, 1)
#### datetime variable year .quarter .month .week .DMY .dayinyear .dayinmonth .weekday .is_we .hour
#### 1 2013-01-01 00:00:00 1.68959052 2013 1 1 1 2012-12-31 1 1 Tuesday FALSE 0
#### 2 2013-01-01 01:00:00 0.02023722 2013 1 1 1 2013-01-01 1 1 Tuesday FALSE 1
To answer your last question, If an entire level is missing from the calibration dataset (i.e. one whole weed and you're using .Week
as a predictor), you 'll need to impute the data first.
Upvotes: 0