Reputation: 1450
I am having trouble getting a reasonable prediction behavior from least squares fits in statsmodels
version 0.6.1. It does not seem to be providing a sensible value.
Consider the following data
import numpy as np
xx = np.array([1.1,2.2,3.3,4.4]) # Independent variable
XX = sm.add_constant(xx) # Include constant for matrix fitting in statsmodels
yy = np.array([2,1,5,6]) # Dependent variable
ww = np.array([0.1,1,3,0.5]) # Weights to try
wn = ww/ww.sum() # Normalized weights
zz = 1.9 # Independent variable value to predict for
We can use numpy
to do a weighted fit and prediction
np_unw_value = np.polyval(np.polyfit(xx, yy, deg=1, w=1+0*ww), zz)
print("Unweighted fit prediction from numpy.polyval is {sp}".format(sp=np_unw_value))
and we find a prediction of 2.263636.
As a sanity check, we can also see what R has to say about the matter
import pandas as pd
import rpy2.robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.pandas2ri
rpy2.robjects.pandas2ri.activate()
pdf = pd.DataFrame({'x':xx, 'y':yy, 'w':wn})
pdz = pd.DataFrame({'x':[zz], 'y':[np.Inf]})
rfit = rpy2.robjects.r.lm('y~x', data=pdf, weights=1+0*pdf['w']**2)
rpred = rpy2.robjects.r.predict(rfit, pdz)[0]
print("Unweighted fit prediction from R is {sp}".format(sp=rpred))
and again we find a prediction of 2.263636. My problem is that we do not get that result from statmodels OLS
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
owls = sm.OLS(yy, XX).fit()
sm_value_u, iv_lu, iv_uu = wls_prediction_std(owls, exog=np.array([[1,zz]]))
sm_unw_v = sm_value_u[0]
print("Unweighted OLS fit prediction from statsmodels.wls_prediction_std is {sp}".format(sp=sm_unw_v))
Instead I obtain a value 1.695814 (similar things happen with WLS()
). Either there is a bug, or using statsmodels
for prediction has some trick too obscure for me to find. What is going on?
Upvotes: 0
Views: 731
Reputation: 22897
The results classes have a predict
method that provides the prediction for new values of the explanatory variables:
>>> print(owls.predict(np.array([[1,zz]])))
[ 2.26363636]
The first return of wls_prediction_std
is the standard error for the prediction not the prediction itself.
>>> help(wls_prediction_std)
Help on function wls_prediction_std in module statsmodels.sandbox.regression.predstd:
wls_prediction_std(res, exog=None, weights=None, alpha=0.05)
calculate standard deviation and confidence interval for prediction
applies to WLS and OLS, not to general GLS,
that is independently but not identically distributed observations
Parameters
----------
res : regression result instance
results of WLS or OLS regression required attributes see notes
exog : array_like (optional)
exogenous variables for points to predict
weights : scalar or array_like (optional)
weights as defined for WLS (inverse of variance of observation)
alpha : float (default: alpha = 0.05)
confidence level for two-sided hypothesis
Returns
-------
predstd : array_like, 1d
standard error of prediction
same length as rows of exog
interval_l, interval_u : array_like
lower und upper confidence bounds
The sandbox function will be replaced by a new method get_prediction
of the results classes that provides the prediction and the extra results like standard deviation and confidence and prediction intervals.
Upvotes: 2