Swati Kanchan
Swati Kanchan

Reputation: 99

Forecasting Fortnightly data using OLS

I have bimonthly data of the customer sales and the data looks like

Sales_date  Cigarette_sales
10/15/2015   2,783 
11/1/2015    385 
11/15/2015   536 
12/1/2015    768 
12/15/2015   413 
1/1/2016     182 
1/15/2016    529 
2/1/2016     398 
2/15/2016    22 
3/1/2016     65 
3/15/2016    603 
4/1/2016     759 
4/15/2016    64 
5/1/2016     391 
5/15/2016    669 
6/1/2016     833 
6/15/2016    516 
7/1/2016     480 
7/15/2016    260 
8/1/2016     252 
8/15/2016    689 
9/1/2016     119 
9/15/2016    812 
10/1/2016    275 
10/15/2016   425 
11/1/2016    132 
11/15/2016   26 
12/1/2016    170 
12/15/2016   321 
1/1/2017     349 
1/15/2017    102 
2/1/2017     155 
2/15/2017    117 
3/1/2017     99 
3/15/2017    812 
4/1/2017     441 
4/15/2017    51 
5/1/2017     210 
5/15/2017    625 
6/1/2017     42 
6/15/2017    333 
7/1/2017     460 
7/15/2017    1,580 
8/1/2017     632 
8/15/2017    441 
9/1/2017     80 
9/15/2017    723 
10/1/2017    209 
10/15/2017   377 
11/1/2017    493 
11/15/2017   475 
12/1/2017    252 
12/15/2017   735

Since for linear regressions on time series data, we need a numeric indicator for time period so I have created a days variable for DatetimeIndex which is simply a counter from 0 on my training dataset

train['days'] = range(len(train))

10/15/2015   2,783  0
11/1/2015    385    1
11/15/2015   536    2
12/1/2015    768    3
12/15/2015   413    4
1/1/2016     182    5
1/15/2016    529    6
2/1/2016     398    7
2/15/2016    22     8
3/1/2016     65     9
3/15/2016    603    10
4/1/2016     759    11
4/15/2016    64     12
5/1/2016     391    13
5/15/2016    669    14
6/1/2016     833    15
6/15/2016    516    16
7/1/2016     480    17
7/15/2016    260    18
8/1/2016     252    19
8/15/2016    689    20
9/1/2016     119    21
9/15/2016    812    22
10/1/2016    275    23
10/15/2016   425    24
11/1/2016    132    25
11/15/2016   26     26
12/1/2016    170    27
12/15/2016   321    28
1/1/2017     349    29
1/15/2017    102    30
2/1/2017     155    31
2/15/2017    117    32
3/1/2017     99     33
3/15/2017    812    34
4/1/2017     441    35
4/15/2017    51     36
5/1/2017     210    37
5/15/2017    625    38
6/1/2017     42     39
6/15/2017    333    40
7/1/2017     460    41
7/15/2017    1,580  42
8/1/2017     632    43
8/15/2017    441    44
9/1/2017     80     45
9/15/2017    723    46
10/1/2017    209    47
10/15/2017   377    48
11/1/2017    493    49
11/15/2017   475    50
12/1/2017    252    51
12/15/2017   735    52

when I try to fit the OLS regression on the cigarette sales on days

sales_lm4 = smf.ols(cigarette_sales ~ days', data=salesdata).fit()

this is my summary

Dep. Variable: cigarette_sales R-squared: 0.001
Model: OLS Adj. R-squared: -0.019
Method: Least Squares F-statistic: 0.03564
Date: Fri, 29 Jun 2018 Prob (F-statistic): 0.851
Time: 8:51:28 Log-Likelihood: -73.31
No. Observations: 52 AIC: 150.6
Df Residuals: 50 BIC: 154.5
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975] Intercept 19.3901 0.284 68.187 0 18.819 19.961 days 0.0018 0.009 0.189 0.851 -0.017 0.021 Omnibus: 8.573 Durbin-Watson: 1.795
Prob(Omnibus): 0.014 Jarque-Bera (JB): 8.209
Skew: -0.959 Prob(JB): 0.0165
Kurtosis: 3.331 Cond. No. 61.8

Even when I take the log of the sales data there is not much change in the summary statistics

sales_lm4 = smf.ols(log_cigarette_sales ~ days', data=salesdata).fit()

Dep. Variable: log_cigarette_sales R-squared: 0.002

Model: OLS Adj. R-squared: -0.017

Method: Least Squares F-statistic: 0.1134

Date: Tue, 03 Jul 2018 Prob (F-statistic): 0.738

Time: 09:29:19 Log-Likelihood: -76.756

No. Observations: 53 AIC: 157.5

Df Residuals: 51 BIC: 161.5

Df Model: 1

Covariance Type: nonrobust

coef

std err

t

P>|t|

[0.025

0.975]

Intercept 19.5629 0.284 68.799 0.000 18.992 20.134

days -0.0032 0.009 -0.337 0.738 -0.022 0.016

Omnibus: 6.854 Durbin-Watson: 1.701

Prob(Omnibus): 0.032 Jarque-Bera (JB): 5.975

Skew: -0.789 Prob(JB): 0.0504

Kurtosis: 3.462 Cond. No. 59.5

how can I use more modify this series to produce a good model where I have a good r square and adjusted r square values. What is that I am doing wrong here?

Upvotes: 2

Views: 169

Answers (1)

AlainD
AlainD

Reputation: 6356

Before anything, make a plot. It is quite obvious on it that 10/15/2015 and 7/15/2017 are outliers. Remove them form the training set.

The data still fluctuates a lot. You have two ways of getting rid of the fluctuation: (a) make a moving average, (b) look at the cumulative data. For the first, if you do not average on a full a seasonal cycle, you introduce a bias (which can be removed afterwards but requires some extra computation). So for the first pass, I prefer to keep the data as raw as possible.

Taking cumulative data [y(t) = x(t)+y(t-1)] makes the fluctuation vanishes (there is not magic, the fluctuation is an order of magnitude lower than the trend). The graph here is almost linear, indicating a stable (constant) customer sale. A linear regression tells you the sloop is 369.48, in other words the sales is about 369.48 per 15 days.

You can then compute the residual e(t)=x(t)-369.48, and look for seasonality in there. I do not see any, but be sure by making an autocorrelation.

So your best forecast is 369.48.

enter image description here

Upvotes: 2

Related Questions