user2867237
user2867237

Reputation: 457

Regression with Date variable (python)

I have a time series (daily) dataset consisting of 1 label (integer) and 15 features over 5 years. I have no idea about the meaning of features, but I have to predict the labels based on those features.

To do so, first, I used the autocorrelation_plot from pandas.tools.plotting to figure out if I have any seasonality in my label (y) or not. Please see the figure below:

enter image description here

Then I used seasonal_decompose to find seasonal, trend and residual of my label (y) by sweeping the Freq parameter:

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

Upvotes: 1

Views: 1134

Answers (1)

Stéphane
Stéphane

Reputation: 207

Let me explain to you how seasonality is usually treated.

Most of the time, people try to extract a seasonal component and deal with the corrected series for analysis. In North America, statistical agencies apply a sequence of symmetric moving average filters to estimate seasonal, tend-cycle and irregular components and seasonnally adjusted data corresponds to data minus the estimated seasonal component. Usually, they also provide raw data in other tables and, sometimes, they also provide trend-cycle in yet other tables. In Australia, they prefer to present trend-cycles.

In Europe, decomposition is usually based upon a model: they specify an ARIMA model with seasonal components -- it allows for integrated seasonal components, moving averager components in seasonal dynamics, etc. -- and proceed to a decomposition by imposing hypotheses on the model to extract specific frequencies.

Now, the first thing you need to know is what exactly your function does. If you it uses moving average filters, you have to be aware that those filters are symmetric and that it forces the use of backcasts and forecasts (you need points before the beginning and after the end to apply symmetric filters -- it's the same end point problem faced by filters like the Hoddrick-Prescott, for instance). So, it needs to specify a good ARIMA with seasonality as a proxy to not make end points behave too poorly (or specify asymmetric filters for end points) and the symmetry implies a small data-snooping bias if you use the corrected dataset to compare forecasting models (because all new points contain future information). If you use an ARIMA model, the filter is asymmetric and corrected data points are not built using future points.

Now, to forecast, you have two options. (1) You can try to forecast the corrected value (you can then either forecast seasonality separately, if you need raw values abolsutely); (2) you forecast the raw series.

It's not obvious what is the best way to proceed. In theory, you want (2), but it can be very complicated -- like, frontier research models --, unless you use an ARIMA with seasonal component or impose constant seasonality and use seasonal dummies.

As for the 'frequency' choice, I tend to use informal tests to determine what is appropriate. In the moving average literature, we pick how long or short we want our filters -- and the goal is to produce estimated seasonals that capture entirely seasonal regularities. You can use nonparamateric tests on corrected data, like the Kruskal-Wallis test, but it is rather forgiving.

My advice, which I believe is preferable for forecasting, would be to find a package that allows you to work with parametric models with seasonality. Then, you'd have clear tests and information criteria to use to make decisions on sound statistical ground.

Upvotes: 1

Related Questions