Reputation: 4914
There seems to be two methods for OLS fits in python. The Sklearn one and the Statsmodel one. I have a preference for the statsmodel one because it gives the error on the coefficients via the summary() function. However, I would like to use the TransformedTargetRegressor from sklearn to log my target. It would seem that I need to choose between getting the error on my fit coefficients in statsmodel and being able to transform my target in statsmodel. Is there a good way to do both of these at the same time in either system?
In stats model it would be done like this
import statsmodels.api as sm
X = sm.add_constant(X)
ols = sm.OLS(y, X)
ols_result = ols.fit()
print(ols_result.summary())
To return the fit with the coefficients and the error on them
For Sklearn you can use the TransformedTargetRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)
print('Coefficients: \n', regr.coef_)
But there is no way to get the error on the coefficients without calculating them yourself. Is there a good way to get the best of both worlds?
EDIT
I found a good example for the special case I care about here
Upvotes: 2
Views: 1734
Reputation: 11514
Just to add a lengthy comment here, I believe that TransformedTargetRegressor
does not do what you think it does. As far as I can tell, the inverse transformation function is only applied when the predict
method is called. It does not express the coefficients in units of the untransformed outcome.
import pandas as pd
import statsmodels.api as sm
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn import datasets
df = pd.DataFrame(datasets.load_iris().data)
df.columns = datasets.load_iris().feature_names
X = df.loc[:,['sepal length (cm)', 'sepal width (cm)']]
y = df.loc[:, 'petal width (cm)']
regr = TransformedTargetRegressor(regressor=LinearRegression(),func=np.log1p, inverse_func=np.expm1)
regr.fit(X, y)
print(regr.regressor_.intercept_)
for coef in regr.regressor_.coef_:
print(coef)
#-0.45867804195769357
# 0.3567583897503805
# -0.2962942997303887
X = sm.add_constant(X)
ols_trans = sm.OLS(np.log1p(y), X).fit()
print(ols_trans.params)
#const -0.458678
#sepal length (cm) 0.356758
#sepal width (cm) -0.296294
#dtype: float64
You see that in both cases, the coefficients are identical.That is, using the regression with TransformedTargetRegressor
yields the same coefficients as statsmodels.OLS
with the transformed outcome. TransformedTargetRegressor
does not backtranslate the coefficients into the original untransformed space. Note that the coefficients would be non-linear in the original space unless the transformation itself is linear, in which case this is trivial (adding and multiplying with constants). This discussion here points into a similar direction - backtransforming betas is infeasible in most/many cases.
If interpretation is your goal, I believe the closest you can get to what you wish to achieve is to use predicted values where you vary the regressors or the coefficients. So, let me give you an example: if your goal is to say what's the effect of one standard error higher for sepal length
on the untransformed outcome, you can create the predicted values as fitted as well as the predicted values for the 1-sigma scenario (either by tempering with the coefficient, or by tempering with the corresponding column in X).
# Toy example to add one sigma to sepal length coefficient
coeffs = ols_trans.params.copy()
coeffs['sepal length (cm)'] += 0.018 # this is one sigma
# function to predict and translate predictions back:
def get_predicted_backtransformed(coeffs, data, inv_func):
return inv_func(data.dot(coeffs))
# get standard predicted values, backtransformed:
original = get_predicted_backtransformed(ols_trans.params, X, np.expm1)
# get counterfactual predicted values, backtransformed:
variant1 = get_predicted_backtransformed(coeffs, X, np.expm1)
Then you can say e.g. about the mean shift in the untransformed outcome:
variant1.mean()-original.mean()
#0.2523083548367202
Upvotes: 1
Reputation: 197
In short, Scikit learn cannot help you in calculating coefficient standard errors. However, if you opt to use it, you can just calculate the errors by yourself. In the question Python scikit learn Linear Model Parameter Standard Error @grisaitis provided a great answer explaining the main concepts behind it.
If you only want to use a plug-and-play function that will work with sciait-learn you can use this:
def get_coef_std_errors(reg: 'sklearn.linear_model.LinearRegression',
y_true: 'np.ndarray', X: 'np.ndarray'):
"""Function that calculates the standard deviation of the coefficients of
a linear regression.
Parameters
----------
reg : sklearn.linear_model.LinearRegression
LinearRegression object which has been fitted
y_true : np.ndarray
array containing the target variable
X : np.ndarray
array containing the features used in the regression
Returns
-------
beta_std
Standard deviation of the regression coefficients
"""
y_pred = reg.predict(X) # get predictions
errors = y_true - y_pred # calculate residuals
sigma_sq_hat = np.var(errors) # calculate residuals std error
sigma_beta_hat = sigma_sq_hat * np.linalg.inv(X.T @ X)
return np.sqrt(np.diagonal(sigma_beta_hat)) # diagonal to recover variances
Upvotes: 0