Brian B
Brian B

Reputation: 1450

Getting correct exogenous least squares prediction in Python statsmodels

I am having trouble getting a reasonable prediction behavior from least squares fits in statsmodels version 0.6.1. It does not seem to be providing a sensible value.

Consider the following data

import numpy as np

xx = np.array([1.1,2.2,3.3,4.4])  # Independent variable
XX = sm.add_constant(xx)  # Include constant for matrix fitting in statsmodels
yy = np.array([2,1,5,6])  # Dependent variable
ww = np.array([0.1,1,3,0.5])  # Weights to try
wn = ww/ww.sum()  # Normalized weights
zz = 1.9  # Independent variable value to predict for

We can use numpy to do a weighted fit and prediction

np_unw_value = np.polyval(np.polyfit(xx, yy, deg=1, w=1+0*ww), zz)
print("Unweighted fit prediction from numpy.polyval is {sp}".format(sp=np_unw_value))

and we find a prediction of 2.263636.

As a sanity check, we can also see what R has to say about the matter

import pandas as pd
import rpy2.robjects
from rpy2.robjects.packages import importr
import rpy2.robjects.pandas2ri

rpy2.robjects.pandas2ri.activate()
pdf = pd.DataFrame({'x':xx, 'y':yy, 'w':wn})
pdz = pd.DataFrame({'x':[zz], 'y':[np.Inf]})
rfit = rpy2.robjects.r.lm('y~x', data=pdf, weights=1+0*pdf['w']**2)
rpred = rpy2.robjects.r.predict(rfit, pdz)[0]
print("Unweighted fit prediction from R is {sp}".format(sp=rpred))

and again we find a prediction of 2.263636. My problem is that we do not get that result from statmodels OLS

import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

owls = sm.OLS(yy, XX).fit()
sm_value_u, iv_lu, iv_uu = wls_prediction_std(owls, exog=np.array([[1,zz]]))
sm_unw_v = sm_value_u[0]
print("Unweighted OLS fit prediction from statsmodels.wls_prediction_std is {sp}".format(sp=sm_unw_v))

Instead I obtain a value 1.695814 (similar things happen with WLS()). Either there is a bug, or using statsmodels for prediction has some trick too obscure for me to find. What is going on?

Upvotes: 0

Views: 731

Answers (1)

Josef
Josef

Reputation: 22897

The results classes have a predict method that provides the prediction for new values of the explanatory variables:

>>> print(owls.predict(np.array([[1,zz]])))
[ 2.26363636]

The first return of wls_prediction_std is the standard error for the prediction not the prediction itself.

>>> help(wls_prediction_std)
Help on function wls_prediction_std in module statsmodels.sandbox.regression.predstd:

wls_prediction_std(res, exog=None, weights=None, alpha=0.05)
    calculate standard deviation and confidence interval for prediction

    applies to WLS and OLS, not to general GLS,
    that is independently but not identically distributed observations

    Parameters
    ----------
    res : regression result instance
        results of WLS or OLS regression required attributes see notes
    exog : array_like (optional)
        exogenous variables for points to predict
    weights : scalar or array_like (optional)
        weights as defined for WLS (inverse of variance of observation)
    alpha : float (default: alpha = 0.05)
        confidence level for two-sided hypothesis

    Returns
    -------
    predstd : array_like, 1d
        standard error of prediction
        same length as rows of exog
    interval_l, interval_u : array_like
        lower und upper confidence bounds

The sandbox function will be replaced by a new method get_prediction of the results classes that provides the prediction and the extra results like standard deviation and confidence and prediction intervals.

http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.get_prediction.html

Upvotes: 2

Related Questions