Reputation: 99
Sales_date Cigarette_sales
10/15/2015 2,783
11/1/2015 385
11/15/2015 536
12/1/2015 768
12/15/2015 413
1/1/2016 182
1/15/2016 529
2/1/2016 398
2/15/2016 22
3/1/2016 65
3/15/2016 603
4/1/2016 759
4/15/2016 64
5/1/2016 391
5/15/2016 669
6/1/2016 833
6/15/2016 516
7/1/2016 480
7/15/2016 260
8/1/2016 252
8/15/2016 689
9/1/2016 119
9/15/2016 812
10/1/2016 275
10/15/2016 425
11/1/2016 132
11/15/2016 26
12/1/2016 170
12/15/2016 321
1/1/2017 349
1/15/2017 102
2/1/2017 155
2/15/2017 117
3/1/2017 99
3/15/2017 812
4/1/2017 441
4/15/2017 51
5/1/2017 210
5/15/2017 625
6/1/2017 42
6/15/2017 333
7/1/2017 460
7/15/2017 1,580
8/1/2017 632
8/15/2017 441
9/1/2017 80
9/15/2017 723
10/1/2017 209
10/15/2017 377
11/1/2017 493
11/15/2017 475
12/1/2017 252
12/15/2017 735
Since for linear regressions on time series data, we need a numeric indicator for time period so I have created a days variable for DatetimeIndex which is simply a counter from 0 on my training dataset
train['days'] = range(len(train))
10/15/2015 2,783 0
11/1/2015 385 1
11/15/2015 536 2
12/1/2015 768 3
12/15/2015 413 4
1/1/2016 182 5
1/15/2016 529 6
2/1/2016 398 7
2/15/2016 22 8
3/1/2016 65 9
3/15/2016 603 10
4/1/2016 759 11
4/15/2016 64 12
5/1/2016 391 13
5/15/2016 669 14
6/1/2016 833 15
6/15/2016 516 16
7/1/2016 480 17
7/15/2016 260 18
8/1/2016 252 19
8/15/2016 689 20
9/1/2016 119 21
9/15/2016 812 22
10/1/2016 275 23
10/15/2016 425 24
11/1/2016 132 25
11/15/2016 26 26
12/1/2016 170 27
12/15/2016 321 28
1/1/2017 349 29
1/15/2017 102 30
2/1/2017 155 31
2/15/2017 117 32
3/1/2017 99 33
3/15/2017 812 34
4/1/2017 441 35
4/15/2017 51 36
5/1/2017 210 37
5/15/2017 625 38
6/1/2017 42 39
6/15/2017 333 40
7/1/2017 460 41
7/15/2017 1,580 42
8/1/2017 632 43
8/15/2017 441 44
9/1/2017 80 45
9/15/2017 723 46
10/1/2017 209 47
10/15/2017 377 48
11/1/2017 493 49
11/15/2017 475 50
12/1/2017 252 51
12/15/2017 735 52
when I try to fit the OLS regression on the cigarette sales on days
sales_lm4 = smf.ols(cigarette_sales ~ days', data=salesdata).fit()
this is my summary
Dep. Variable: cigarette_sales R-squared: 0.001
Model: OLS Adj. R-squared: -0.019
Method: Least Squares F-statistic: 0.03564
Date: Fri, 29 Jun 2018 Prob (F-statistic): 0.851
Time: 8:51:28 Log-Likelihood: -73.31
No. Observations: 52 AIC: 150.6
Df Residuals: 50 BIC: 154.5
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 19.3901 0.284 68.187 0 18.819 19.961
days 0.0018 0.009 0.189 0.851 -0.017 0.021
Omnibus: 8.573 Durbin-Watson: 1.795
Prob(Omnibus): 0.014 Jarque-Bera (JB): 8.209
Skew: -0.959 Prob(JB): 0.0165
Kurtosis: 3.331 Cond. No. 61.8
sales_lm4 = smf.ols(log_cigarette_sales ~ days', data=salesdata).fit()
Dep. Variable: log_cigarette_sales R-squared: 0.002
Model: OLS Adj. R-squared: -0.017
Method: Least Squares F-statistic: 0.1134
Date: Tue, 03 Jul 2018 Prob (F-statistic): 0.738
Time: 09:29:19 Log-Likelihood: -76.756
No. Observations: 53 AIC: 157.5
Df Residuals: 51 BIC: 161.5
Df Model: 1
Covariance Type: nonrobust
coef
std err
t
P>|t|
[0.025
0.975]
Intercept 19.5629 0.284 68.799 0.000 18.992 20.134
days -0.0032 0.009 -0.337 0.738 -0.022 0.016
Omnibus: 6.854 Durbin-Watson: 1.701
Prob(Omnibus): 0.032 Jarque-Bera (JB): 5.975
Skew: -0.789 Prob(JB): 0.0504
Kurtosis: 3.462 Cond. No. 59.5
Upvotes: 2
Views: 169
Reputation: 6356
Before anything, make a plot. It is quite obvious on it that 10/15/2015 and 7/15/2017 are outliers. Remove them form the training set.
The data still fluctuates a lot. You have two ways of getting rid of the fluctuation: (a) make a moving average, (b) look at the cumulative data. For the first, if you do not average on a full a seasonal cycle, you introduce a bias (which can be removed afterwards but requires some extra computation). So for the first pass, I prefer to keep the data as raw as possible.
Taking cumulative data [y(t) = x(t)+y(t-1)
] makes the fluctuation vanishes (there is not magic, the fluctuation is an order of magnitude lower than the trend). The graph here is almost linear, indicating a stable (constant) customer sale. A linear regression tells you the sloop is 369.48, in other words the sales is about 369.48 per 15 days.
You can then compute the residual e(t)=x(t)-369.48
, and look for seasonality in there. I do not see any, but be sure by making an autocorrelation.
So your best forecast is 369.48.
Upvotes: 2