Reputation: 61
I use python to deal with a linear regression model, the json data is as below:
{"Y":[1,2,3,4,5],"X":[[1,43,23],[2,3,43],[3,23,334],[4,43,23],[232,234,24]]}
I use statsmodels.api.sm.OLS().fit and statsmodels.formula.api.ols.fit(), I think they are same model, but the results are different.
here is the first function:
import statsmodels.api as sm
def analyze1():
print 'using sm.OLS().fit'
data = json.load(open(FNAME_DATA))
X = np.asarray(data['X'])
Y = np.log(np.asarray(data['Y']) + 1)
X2 = sm.add_constant(X)
results = sm.OLS(Y, X2).fit()
print results.summary()
here is the second function:
from statsmodels.formula.api import ols
def analyze2():
print 'using ols().fit'
data = json.load(open(FNAME_DATA))
results=ols('Y~X+1',data=data).fit()
print results.summary()
the first function outputs:
using sm.OLS().fit
/home/aaron/anaconda2/lib/python2.7/site-packages/statsmodels/stats/stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 5 samples were given.
"samples were given." % int(n), ValueWarning)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.449
Model: OLS Adj. R-squared: -1.204
Method: Least Squares F-statistic: 0.2717
Date: Wed, 07 Aug 2019 Prob (F-statistic): 0.849
Time: 07:17:00 Log-Likelihood: -0.87006
No. Observations: 5 AIC: 9.740
Df Residuals: 1 BIC: 8.178
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.0859 0.720 1.509 0.373 -8.057 10.228
x1 0.0024 0.018 0.134 0.915 -0.229 0.234
x2 0.0005 0.020 0.027 0.983 -0.256 0.257
x3 0.0008 0.003 0.332 0.796 -0.031 0.033
==============================================================================
Omnibus: nan Durbin-Watson: 1.485
Prob(Omnibus): nan Jarque-Bera (JB): 0.077
Skew: 0.175 Prob(JB): 0.962
Kurtosis: 2.503 Cond. No. 402.
==============================================================================
the second function outputs:
using ols().fit
OLS Regression Results
==============================================================================
Dep. Variable: Y R-squared: 0.551
Model: OLS Adj. R-squared: -0.796
Method: Least Squares F-statistic: 0.4092
Date: Wed, 07 Aug 2019 Prob (F-statistic): 0.784
Time: 07:17:00 Log-Likelihood: -6.8251
No. Observations: 5 AIC: 21.65
Df Residuals: 1 BIC: 20.09
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.9591 2.368 0.827 0.560 -28.124 32.042
X[0] 0.0030 0.060 0.051 0.968 -0.757 0.764
X[1] 0.0098 0.066 0.148 0.906 -0.834 0.854
X[2] 0.0024 0.008 0.289 0.821 -0.103 0.108
==============================================================================
Omnibus: nan Durbin-Watson: 1.485
Prob(Omnibus): nan Jarque-Bera (JB): 0.077
Skew: 0.175 Prob(JB): 0.962
Kurtosis: 2.503 Cond. No. 402.
==============================================================================
I think these are similar model, but using the same data the result(coef) and log-likelihood are diffierent, I don't know if these two models have some differences.
Upvotes: 5
Views: 6558
Reputation: 2188
The former (OLS
) is a class. The latter (ols
) is a method of the OLS
class that is inherited from statsmodels.base.model.Model
.
In [11]: from statsmodels.api import OLS
In [12]: from statsmodels.formula.api import ols
In [13]: OLS
Out[13]: statsmodels.regression.linear_model.OLS
In [14]: ols
Out[14]: <bound method Model.from_formula of <class 'statsmodels.regression.linear_model.OLS'>>
Based on my own testing, I believe the models should produce the same result. However, in your example you are applying log to y in the first model, but not in the second. The fields that are the same are computed solely from X which is the same in both models. The fields that are different are so as a result of the difference in y.
Since I do not have access to your data, feel free to use this standalone example as a sanity check. These two models (which seem to be rubbish) produced the same summary after I fitted them.
Example:
import pandas as pd
import statsmodels.api as sm
import numpy as np
from sklearn.datasets import load_diabetes
from statsmodels.formula.api import ols
X = pd.DataFrame(data=load_diabetes()['data'],
columns=load_diabetes()['feature_names'])
X.drop(['age', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], axis=1, inplace=True)
X = sm.add_constant(X)
y = pd.DataFrame(data=load_diabetes()['target'], columns=['y'])
mod1 = sm.OLS(np.log(y), X)
results1 = mod1.fit()
print(results1.summary())
mod2 = ols('np.log(y) ~ sex + bmi', data=pd.concat([X, y], axis=1))
results2 = mod2.fit()
print(results2.summary())
Output (OLS):
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.297
Model: OLS Adj. R-squared: 0.294
Method: Least Squares F-statistic: 92.90
Date: Tue, 06 Aug 2019 Prob (F-statistic): 2.27e-34
Time: 21:06:21 Log-Likelihood: -291.29
No. Observations: 442 AIC: 588.6
Df Residuals: 439 BIC: 600.9
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.8813 0.022 218.671 0.000 4.837 4.925
sex -0.0868 0.471 -0.184 0.854 -1.013 0.839
bmi 6.4042 0.471 13.593 0.000 5.478 7.330
==============================================================================
Omnibus: 14.733 Durbin-Watson: 1.892
Prob(Omnibus): 0.001 Jarque-Bera (JB): 15.547
Skew: -0.446 Prob(JB): 0.000421
Kurtosis: 2.776 Cond. No. 22.0
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Output (ols):
OLS Regression Results
==============================================================================
Dep. Variable: np.log(y) R-squared: 0.297
Model: OLS Adj. R-squared: 0.294
Method: Least Squares F-statistic: 92.90
Date: Wed, 27 May 2020 Prob (F-statistic): 2.27e-34
Time: 01:42:40 Log-Likelihood: -291.29
No. Observations: 442 AIC: 588.6
Df Residuals: 439 BIC: 600.9
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 4.8813 0.022 218.671 0.000 4.837 4.925
sex -0.0868 0.471 -0.184 0.854 -1.013 0.839
bmi 6.4042 0.471 13.593 0.000 5.478 7.330
==============================================================================
Omnibus: 14.733 Durbin-Watson: 1.892
Prob(Omnibus): 0.001 Jarque-Bera (JB): 15.547
Skew: -0.446 Prob(JB): 0.000421
Kurtosis: 2.776 Cond. No. 22.0
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Upvotes: 3