Reputation: 347
I'm trying (for the first time) to add an external variable to prophet with the add_regressor
function, but the results I'm getting look wild. The dataset I'm using is freely available on kaggle (the well known shampoo sales) here. I'm attempting to use freely available data for the SPY stock index using R's quantmod
package as my external variable.
Here's how I start the code:
library(prophet)
library(quantmod)
library(dplyr)
df <- read.csv("~/shampoo.csv")
#now get the min and max dates in the column
min_date <- min(df$Date, na.rm = TRUE)
max_date <- max(df$Date, na.rm = TRUE)
#download the SPY stock data
getSymbols("SPY", from = min_date, to = max_date)
#SPY closes stored into a df and massage a bit
Close <- data.frame(Cl(SPY))
Close <- cbind(ds = rownames(Close), Close)
rownames(Close) <- NULL
Close_no_rename <- Close
colnames(Close)[colnames(Close) == 'SPY.Close'] <- 'y'
colnames(Close_no_rename)[colnames(Close_no_rename) == 'SPY.Close'] <- 'SPY_CLOSE'
#now put this into prophet and make a forecast for the forecast_period for SPY
stock_model <- prophet(Close)
#make a forecast dataframe
future_stocks <- make_future_dataframe(stock_model, periods = 30, freq = "month", include_history = FALSE)
#the below df will have predicted stock prices of the SPY. want to extract the future y values as point forecast along with dates
forecast <- predict(stock_model, future_stocks) %>% select(ds, yhat)
colnames(forecast)[colnames(forecast) == 'yhat'] <- 'SPY_CLOSE'
#rename the columns of the actual df
colnames(df)[colnames(df) == 'Date'] <- 'ds'
colnames(df)[colnames(df) == 'Value'] <- 'y'
#now want to merge the Close df y historic values onto the training df, merge by date ds column
df_historic_with_SPY_close <- merge(df, Close_no_rename, by = "ds")
#now actually forecast using prophet
model <- prophet()
#add the SPY regressor
model <- add_regressor(model, 'SPY_CLOSE', prior.scale = 0.0000001, standardize = FALSE)
model <- fit.prophet(model, df_historic_with_SPY_close)
forecast_final <- predict(model, forecast)
plot(model, forecast_final)
This does not throw any errors but the plot of the forecast looks...wrong. It looks as if the scale is off or something. I tried fiddling with the prior and standardize settings with no luck. Thanks for any help!
Here is the shampoo dataset being used as the main variable:
Date Value
2017-01-01 266
2017-02-01 145.9
2017-03-01 183.1
2017-04-01 119.3
2017-05-01 180.3
2017-06-01 168.5
2017-07-01 231.8
2017-08-01 224.5
2017-09-01 192.8
2017-10-01 122.9
2017-11-01 336.5
2017-12-01 185.9
2018-01-01 194.3
2018-02-01 149.5
2018-03-01 210.1
2018-04-01 273.3
2018-05-01 191.4
2018-06-01 287
2018-07-01 226
2018-08-01 303.6
2018-09-01 289.9
2018-10-01 421.6
2018-11-01 264.5
2018-12-01 342
2019-01-01 339.7
2019-02-01 440.4
2019-03-01 315.9
2019-04-01 439.3
2019-05-01 401.3
2019-06-01 437.4
2019-07-01 575.5
2019-08-01 407.6
2019-09-01 682
2019-10-01 475.3
2019-11-01 581.3
2019-12-01 646.9
Upvotes: 3
Views: 1456
Reputation: 877
I think I have fixed things, but the only thing I did differently was change the dates from factor format to date format, and tell R to use the select function from dplyr. I also ran R without any other packages loaded. So, it's still a bit of a mystery why this worked.
I tumbled to this issue when the line
df_historic_with_SPY_close <- merge(df, Close_no_rename, by = "ds")
didn't work properly. I discovered it was because I had formatted the df dates as dates to get them to work with getSymbols
but then they were different from Close_no_rename.
First, the data I used:
df<-dput(df)
structure(list(ds = structure(c(17167, 17198, 17226, 17257, 17287,
17318, 17348, 17379, 17410, 17440, 17471, 17501, 17532, 17563,
17591, 17622, 17652, 17683, 17713, 17744, 17775, 17805, 17836,
17866, 17897, 17928, 17956, 17987, 18017, 18048, 18078, 18109,
18140, 18170, 18201, 18231), class = "Date"), y = c(266, 145.9,
183.1, 119.3, 180.3, 168.5, 231.8, 224.5, 192.8, 122.9, 336.5,
185.9, 194.3, 149.5, 210.1, 273.3, 191.4, 287, 226, 303.6, 289.9,
421.6, 264.5, 342.3, 339.7, 440.4, 315.9, 439.3, 401.3, 437.4,
575.5, 407.6, 682, 475.3, 581.3, 646.9)), row.names = c(NA, -36L
), class = "data.frame")
library(prophet)
library(quantmod)
library(dplyr)
# can use your df, rather than above
df<-read.csv("~/shampoo.csv")
# either way, should run this
df$Date<-as.Date.factor(df$Date,tryFormats = c("%d-%m-%y"))
str(df) #check
#now get the min and max dates in the column
min_date <- min(df$Date, na.rm = TRUE)
max_date <- max(df$Date, na.rm = TRUE)
#download the SPY stock data
getSymbols("SPY", from = min_date, to = max_date)
#SPY closes stored into a df and massage a bit
Close <- data.frame(Cl(SPY))
Close <- cbind(ds = rownames(Close), Close)
rownames(Close) <- NULL
Close_no_rename <- Close
colnames(Close)[colnames(Close) == 'SPY.Close'] <- 'y'
colnames(Close_no_rename)[colnames(Close_no_rename) == 'SPY.Close'] <- 'SPY_CLOSE'
# make dates in date format
Close_no_rename$ds<-as.Date(Close_no_rename$ds)
str(Close_no_rename)
#now put this into prophet and make a forecast for the forecast_period for SPY
stock_model <- prophet(Close)
#make a forecast dataframe
future_stocks <- make_future_dataframe(stock_model, periods = 30, freq = "month", include_history = FALSE)
#the below df will have predicted stock prices of the SPY. want to extract the future y values as point forecast along with dates
# specify dplyr:::select
forecast <- predict(stock_model, future_stocks) %>% dplyr:::select(ds, yhat)
colnames(forecast)[colnames(forecast) == 'yhat'] <- 'SPY_CLOSE'
#rename the columns of the actual df
colnames(df)[colnames(df) == 'Date'] <- 'ds'
colnames(df)[colnames(df) == 'Value'] <- 'y'
#now want to merge the Close df y historic values onto the training df, merge by date ds column
df_historic_with_SPY_close <- merge(df, Close_no_rename, by = "ds")
df_historic_with_SPY_close
#now actually forecast using prophet
model <- prophet()
#add the SPY regressor
model <- add_regressor(model, 'SPY_CLOSE', prior.scale = 0.0000001, standardize = FALSE)
model <- fit.prophet(model, df_historic_with_SPY_close)
forecast_final <- predict(model, forecast)
plot(model, forecast_final)
Result:
Using the following dataset, with dates changed to one that is close in the SPY dataset:
df<-dput(df)
structure(list(ds = structure(c(17169, 17198, 17226, 17259, 17287,
17318, 17350, 17379, 17410, 17442, 17471, 17501, 17534, 17563,
17591, 17624, 17652, 17683, 17715, 17744, 17778, 17805, 17836,
17868, 17898, 17928, 17956, 17987, 18017, 18050, 18078, 18109,
18142, 18170, 18201, 18232), class = "Date"), y = c(266, 145.9,
183.1, 119.3, 180.3, 168.5, 231.8, 224.5, 192.8, 122.9, 336.5,
185.9, 194.3, 149.5, 210.1, 273.3, 191.4, 287, 226, 303.6, 289.9,
421.6, 264.5, 342.3, 339.7, 440.4, 315.9, 439.3, 401.3, 437.4,
575.5, 407.6, 682, 475.3, 581.3, 646.9)), class = "data.frame", row.names = c(NA,
-36L))
We get this, which looks much better:
The problem is to do with missing data. Some dates in the shampoo dataset are not in the SPY dataset. The following will select the data from the nearest date in the SPY dataset to overcome the problem of missing data. However, the graph it generates still looks odd, and changing the dates a little seems to be cause of the problem.
Replacing the line:
df_historic_with_SPY_close <- merge(df, Close_no_rename, by = "ds")
With (credit to https://stackoverflow.com/a/28073823/7967291):
library(data.table)
setDT(Close_no_rename)
setDT(df)
setkey(Close_no_rename, ds)[, dateMatch:=ds]
df_historic_with_SPY_close<-Close_no_rename[df, roll='nearest']
df_historic_with_SPY_close<-setDT(df_historic_with_SPY_close)[,-1]
names(df_historic_with_SPY_close)[names(df_historic_with_SPY_close) == "dateMatch"] <- "ds"
df_historic_with_SPY_close
df_historic_with_SPY_close <- mutate ( df_historic_with_SPY_close, ds = ymd(ds) )
str(df_historic_with_SPY_close)
Upvotes: 1