Reputation: 742
I have been trying to make a cubic regression for a few days now, but I encounter the same problem: my result does not coincide with the code I wrote in R to check. The databases are completely the same, so this is not the problem. The code that I have right now is something like this:
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
import numpy as np
df = pd.read_csv("http://web.stanford.edu/~oleg2/hse/boston/boston_house_prices.csv")
df = df.dropna()
x, y = np.array(df.lstat), np.array(df.crim)
polynomial_features= PolynomialFeatures(degree=3)
xp = polynomial_features.fit_transform(x.reshape(-1,1))
model = sm.OLS(y, xp).fit()
print(model.summary())
I have also made something like this:
import pandas as pd
import statsmodels.api as sm
import numpy as np
import statsmodels.formula.api as smf
df = pd.read_csv("http://web.stanford.edu/~oleg2/hse/boston/boston_house_prices.csv")
df = df.dropna()
ft1 = smf.ols(formula=f"crim ~ lstat + I(np.power(lstat,2)) + I(np.power(lstat,3))", data=df).fit()
print(ft1.summary())
These two give completely the same result:
OLS Regression Results
==============================================================================
Dep. Variable: crim R-squared: 0.218
Model: OLS Adj. R-squared: 0.213
Method: Least Squares F-statistic: 46.63
Date: Sat, 03 Oct 2020 Prob (F-statistic): 1.35e-26
Time: 10:26:13 Log-Likelihood: -1744.2
No. Observations: 506 AIC: 3496.
Df Residuals: 502 BIC: 3513.
Df Model: 3
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
Intercept 1.2010 2.029 0.592 0.554 -2.785 5.187
lstat -0.4491 0.465 -0.966 0.335 -1.362 0.464
I(np.power(lstat, 2)) 0.0558 0.030 1.852 0.065 -0.003 0.115
I(np.power(lstat, 3)) -0.0009 0.001 -1.517 0.130 -0.002 0.000
==============================================================================
Omnibus: 607.734 Durbin-Watson: 1.239
Prob(Omnibus): 0.000 Jarque-Bera (JB): 53621.219
Skew: 5.726 Prob(JB): 0.00
Kurtosis: 52.114 Cond. No. 5.20e+04
==============================================================================
And here is the program on R:
fit.lstat2 <- lm(crim ~ poly(lstat, 3))
summary(fit.lstat2)
And it gives the following result:
## Call:
## lm(formula = crim ~ poly(lstat, 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.234 -2.151 -0.486 0.066 83.353
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6135 0.3392 10.654 <2e-16 ***
## poly(lstat, 3)1 88.0697 7.6294 11.543 <2e-16 ***
## poly(lstat, 3)2 15.8882 7.6294 2.082 0.0378 *
## poly(lstat, 3)3 -11.5740 7.6294 -1.517 0.1299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared: 0.2179, Adjusted R-squared: 0.2133
## F-statistic: 46.63 on 3 and 502 DF, p-value: < 2.2e-16
This result is the correct one, but I don't know why python gives the wrong answer. Perhaps I should use some different approaches?
P.S. I have made sure that both codes work, including the database link. So feel free to run them to see the results for yourself.
Upvotes: 1
Views: 1092
Reputation: 17007
I am not expert in R but, i suppose you dont use orthogonal polynomial, so you have to set raw=TRUE
I have the same result than python statsmodels.api when i use raw=TRUE for R process:
fit.lstat2 <- lm(crim ~ poly(lstat, 3, raw=TRUE))
summary(fit.lstat2)
result:
Call:
lm(formula = crim ~ poly(lstat, 3, raw = TRUE))
Residuals:
Min 1Q Median 3Q Max
-15.234 -2.151 -0.486 0.066 83.353
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2009656 2.0286452 0.592 0.5541
poly(lstat, 3, raw = TRUE)1 -0.4490656 0.4648911 -0.966 0.3345
poly(lstat, 3, raw = TRUE)2 0.0557794 0.0301156 1.852 0.0646 .
poly(lstat, 3, raw = TRUE)3 -0.0008574 0.0005652 -1.517 0.1299
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.629 on 502 degrees of freedom
Multiple R-squared: 0.2179, Adjusted R-squared: 0.2133
F-statistic: 46.63 on 3 and 502 DF, p-value: < 2.2e-16
Upvotes: 2