Reputation: 103
I'm porting a Stata model to Python, and seeing different results for Python and Stata for linear regression using the same input data (available @ https://drive.google.com/file/d/0B8PLy9yAUHvlcTI1SG5sdzdnaWc/view?usp=sharing)
The Stata codes are as below:
reg growth time*
predict ghat
predict resid, residuals
And result is (first 5 rows):
. list growth ghat resid
+----------------------------------+
| growth ghat resid |
|----------------------------------|
1. | 2.3527029 2.252279 .1004239 |
2. | 2.377728 2.214551 .163177 |
3. | 2.3547957 2.177441 .177355 |
4. | 3.0027488 2.140942 .8618064 |
5. | 3.0249328 2.10505 .9198825 |
In Python, the codes are:
import pandas as pd
from sklearn.linear_model import LinearRegression
def linear_regression(df, dep_col, indep_cols):
lf = LinearRegression(normalize=True)
lf.fit(df[indep_cols.split(' ')], df[dep_col])
return lf
df = pd.read_stata('/tmp/python.dta')
lr = linear_regression(df, 'growth', 'time time2 time3 time4 time5')
df['ghat'] = lr.predict(df['time time2 time3 time4 time5'.split(' ')])
df['resid'] = df.growth - df.ghat
df.head(5)['growth ghat resid'.split(' ')]
and the result is:
growth ghat resid
0 2.352703 3.026936 -0.674233
1 2.377728 2.928860 -0.551132
2 2.354796 2.833610 -0.478815
3 3.002749 2.741135 0.261614
4 3.024933 2.651381 0.373551
I also tried in R, and got the same result as in Python. I could not figure out the root cause: is it because the algorithm used in Stata is a little bit different? I can tell from the source code that sklearn uses the ordinary least squares, but have no idea about the one in Stata.
Could anyone advise here?
---------- Edit 1 -----------
I tried to specify the data type in Stata as double
, but Stata is still producing the same result as using float
. The Stata codes for generating are as below:
gen double growth = .
foreach lag in `lags' {
replace growth = ma_${metric}_per_`group' / l`lag'.ma_${metric}_per_`group' - 1 if nlag == `lag' & in_sample
}
gen double time = day - td(01jan2010) + 1
forvalues i = 2/5 {
gen double time`i' = time^`i'
}
---------- Edit 2 -----------
It's confirmed that Stata does drop the time
variable due to collinearity. The message was not seen before as our Stata codes enable the quiet
model to suppress undesired messages. This cannot be disabled in Stata per my investigation. So it appears that I need to detect collinearity and remove collinear column(s) in Python as well.
. reg growth time*,
note: time omitted because of collinearity
Source | SS df MS Number of obs = 381
-------------+------------------------------ F( 4, 376) = 126.10
Model | 37.6005042 4 9.40012605 Prob > F = 0.0000
Residual | 28.0291465 376 .074545602 R-squared = 0.5729
-------------+------------------------------ Adj R-squared = 0.5684
Total | 65.6296507 380 .172709607 Root MSE = .27303
------------------------------------------------------------------------------
growth | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
time | 0 (omitted)
time2 | -.0098885 .0009231 -10.71 0.000 -.0117037 -.0080734
time3 | .0000108 1.02e-06 10.59 0.000 8.77e-06 .0000128
time4 | -4.40e-09 4.20e-10 -10.47 0.000 -5.22e-09 -3.57e-09
time5 | 6.37e-13 6.15e-14 10.35 0.000 5.16e-13 7.58e-13
_cons | 3322.727 302.7027 10.98 0.000 2727.525 3917.93
------------------------------------------------------------------------------
Upvotes: 3
Views: 2269
Reputation: 37208
The predictors are the 1st ... 5th powers of time
which varies between 1627 and 2007 (presumably a calendar year, not that it matters). Even with modern software it would have been prudent to shift the origin of time to reduce the numerical strain, e.g. to work with powers of (time
- 1800).
Any way, redoing the regression shows that Stata drops the first predictor as collinear. What happens in Python and R? These are different reactions to a numerically tricky challenge.
(Fitting a quintic polynomial rarely has scientific value, but that may not be of concern here. The fitted curve based on powers 2 to 5 doesn't work very well for these data, which appear economic. It makes more sense that the first 5 residuals are all positive, but that isn't true of them all!)
Upvotes: 2
Reputation: 54330
It is a wild card issue. In your Stata
code time*
will match time2, time3...
but not time
. If the Python
code is changed to lr = linear_regression(df, 'growth', 'time2 time3 time4 time5')
it will crank out the exact same result.
Appears Stata
dropped the 1st independent variable. The fit can be visualized as follows:
lr1 = linear_regression(df, 'growth', 'time time2 time3 time4 time5')
lr2 = linear_regression(df, 'growth', 'time2 time3 time4 time5')
pred_x1 = ((np.linspace(1620, 2000)[..., np.newaxis]**np.array([1,2,3,4,5]))*lr1.coef_).sum(1)+lr1.intercept_
pred_x2 = ((np.linspace(1620, 2000)[..., np.newaxis]**np.array([2,3,4,5]))*lr2.coef_).sum(1)+lr2.intercept_
plt.plot(np.linspace(1620, 2000), pred_x1, label='Python/R fit')
plt.plot(np.linspace(1620, 2000), pred_x2, label='Stata fit')
plt.plot(df.time, df.growth, '+', label='Data')
plt.legend(loc=0)
And the residual sum of squares:
In [149]:
pred1 = (df.time.values[..., np.newaxis]**np.array([1,2,3,4,5])*lr1.coef_).sum(1)+lr1.intercept_
pred2 = (df.time.values[..., np.newaxis]**np.array([2,3,4,5])*lr2.coef_).sum(1)+lr2.intercept_
print 'Python fit RSS',((pred1 - df.growth.values)**2).sum()
print 'Stata fit RSS',((pred2 - df.growth.values)**2).sum()
Python fit RSS 7.2062436549
Stata fit RSS 28.0291464826
Upvotes: 1