Reputation: 401
I am training my data skills in python which I have learned in R. Although, I have a doubt about a simple linear regression
Climate_change Data: [link here]
import os
import pandas as pd
import statsmodels.api as sm
train = df[df.Year>=2006]
X = train[['MEI', 'CO2', 'CH4', 'N2O', 'CFC.11', 'CFC.12', 'TSI', 'Aerosols']]
y = train[['Temp']]
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
model.summary()
Dep. Variable: Temp R-squared: 0.972
Model: OLS Adj. R-squared: 0.964
Method: Least Squares F-statistic: 123.1
Date: Mon, 01 Oct 2018 Prob (F-statistic):9.54e-20
Time: 14:52:53 Log-Likelihood: 46.898
No. Observations: 36 AIC: -77.80
Df Residuals: 28 BIC: -65.13
Df Model: 8
Covariance Type: nonrobust
MEI 0.0361
CO2 0.0046
CH4 -0.0023
N2O -0.0141
CFC-11 -0.0312
CFC-12 0.0358
TSI -0.0033
Aerosols 69.9680
Omnibus:8.397 Durbin-Watson: 1.484
Prob(Omnibus):0.015
Jarque-Bera (JB):10.511Skew: -0.546 Prob(JB): 0.00522
Kurtosis: 5.412
Cond. No. 6.35e+06
train <- climate_change[climate_change$Year>=2006,]
prev <- lm(Temp ~ ., data = train[,3:NCOL(train)])
summary(prev)
Residuals: Min 1Q Median 3Q Max -0.221684 -0.032846 0.002042 0.037158 0.167887
Coefficients: MEI 0.036056 CO2 0.004817
CH4 -0.002366 N2O -0.013007 CFC-11 -0.033194 CFC-12 0.037775 TSI 0.009100 Aerosols 70.463329 Residual standard error: 0.07594 on 27 degrees of freedom Multiple R-squared: 0.5346, Adjusted R-squared: 0.3967 F-statistic: 3.877 on 8 and 27 DF, p-value: 0.003721
The R-squared has big difference between them, also the coefficients of independent variable has a bit difference. Someone could explain why?
Upvotes: 4
Views: 3126
Reputation: 11514
Just to point this out: statsmodel
's least squares fit does by default not include a constant. If we remove the constant from R's fit, we get very similar results to the Python implementation, or the other way around, if we add a constant to the statsmodel
-fit, we get similar results to R
:
Remove the constant in R
's lm
-call:
summary(lm(Temp ~ . - 1, data = train[,3:NCOL(train)]))
Call:
lm(formula = Temp ~ . - 1, data = train[, 3:NCOL(train)])
Residuals:
Min 1Q Median 3Q Max
-0.221940 -0.032347 0.002071 0.037048 0.167294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
MEI 0.036076 0.027983 1.289 0.2079
CO2 0.004640 0.008945 0.519 0.6080
CH4 -0.002328 0.002132 -1.092 0.2843
N2O -0.014115 0.079452 -0.178 0.8603
`CFC-11` -0.031232 0.096693 -0.323 0.7491
`CFC-12` 0.035760 0.103574 0.345 0.7325
TSI -0.003283 0.036861 -0.089 0.9297
Aerosols 69.968040 33.093275 2.114 0.0435 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.07457 on 28 degrees of freedom
Multiple R-squared: 0.9724, Adjusted R-squared: 0.9645
F-statistic: 123.1 on 8 and 28 DF, p-value: < 2.2e-16
Let's add a constant to statsmodel
's call:
X_with_constant = sm.add_constant(X)
model = sm.OLS(y, X_with_constant).fit()
model.summary()
gives us identical results:
OLS Regression Results
Dep. Variable: Temp R-squared: 0.535
Model: OLS Adj. R-squared: 0.397
Method: Least Squares F-statistic: 3.877
Date: Tue, 02 Oct 2018 Prob (F-statistic): 0.00372
Time: 10:14:03 Log-Likelihood: 46.899
No. Observations: 36 AIC: -75.80
Df Residuals: 27 BIC: -61.55
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -17.8663 563.008 -0.032 0.975 -1173.064 1137.332
MEI 0.0361 0.029 1.265 0.217 -0.022 0.095
CO2 0.0048 0.011 0.451 0.656 -0.017 0.027
CH4 -0.0024 0.002 -0.950 0.351 -0.007 0.003
N2O -0.0130 0.088 -0.148 0.884 -0.194 0.168
CFC-11 -0.0332 0.116 -0.285 0.777 -0.272 0.205
CFC-12 0.0378 0.123 0.307 0.761 -0.215 0.290
TSI 0.0091 0.392 0.023 0.982 -0.795 0.813
Aerosols 70.4633 37.139 1.897 0.069 -5.739 146.666
Omnibus: 8.316 Durbin-Watson: 1.488
Prob(Omnibus): 0.016 Jarque-Bera (JB): 10.432
Skew: -0.535 Prob(JB): 0.00543
Kurtosis: 5.410 Cond. No. 1.06e+08
Upvotes: 3
Reputation: 887118
As mentioned in the comments, it could be an issue with multicollinearity based on the warnings given. One way to test whether we get the same r-squared is by using another package sklearn
and build the model based on the LinearRegression
module
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
regr = LinearRegression()
regr.fit(X, y)
predictions = regr.predict(X)
r2_score(y, predictions)
#0.5345800653144226
But, LinearRegression
wouldn't give any summary
output. Have to extract the parameters of interest
Upvotes: 1