Jackie Wong
Jackie Wong

Reputation: 103

Difference between linear regression in Python (and R) and Stata

I'm porting a Stata model to Python, and seeing different results for Python and Stata for linear regression using the same input data (available @ https://drive.google.com/file/d/0B8PLy9yAUHvlcTI1SG5sdzdnaWc/view?usp=sharing)

The Stata codes are as below:

reg growth time*
predict ghat
predict resid, residuals

And result is (first 5 rows):

. list growth ghat resid

     +----------------------------------+
     |    growth       ghat       resid |
     |----------------------------------|
  1. | 2.3527029   2.252279    .1004239 |
  2. |  2.377728   2.214551     .163177 |
  3. | 2.3547957   2.177441     .177355 |
  4. | 3.0027488   2.140942    .8618064 |
  5. | 3.0249328    2.10505    .9198825 |

In Python, the codes are:

import pandas as pd
from sklearn.linear_model import LinearRegression


def linear_regression(df, dep_col, indep_cols):
  lf = LinearRegression(normalize=True)
  lf.fit(df[indep_cols.split(' ')], df[dep_col])

  return lf

df = pd.read_stata('/tmp/python.dta')
lr = linear_regression(df, 'growth', 'time time2 time3 time4 time5')

df['ghat'] = lr.predict(df['time time2 time3 time4 time5'.split(' ')])
df['resid'] = df.growth - df.ghat

df.head(5)['growth ghat resid'.split(' ')]

and the result is:

     growth      ghat     resid
0  2.352703  3.026936 -0.674233
1  2.377728  2.928860 -0.551132
2  2.354796  2.833610 -0.478815
3  3.002749  2.741135  0.261614
4  3.024933  2.651381  0.373551

I also tried in R, and got the same result as in Python. I could not figure out the root cause: is it because the algorithm used in Stata is a little bit different? I can tell from the source code that sklearn uses the ordinary least squares, but have no idea about the one in Stata.

Could anyone advise here?

---------- Edit 1 -----------

I tried to specify the data type in Stata as double, but Stata is still producing the same result as using float. The Stata codes for generating are as below:

gen double growth = .
foreach lag in `lags' {
    replace growth = ma_${metric}_per_`group' / l`lag'.ma_${metric}_per_`group' - 1 if nlag == `lag' & in_sample
}

gen double time = day - td(01jan2010) + 1
forvalues i = 2/5 {
    gen double time`i' = time^`i'
}

---------- Edit 2 -----------

It's confirmed that Stata does drop the time variable due to collinearity. The message was not seen before as our Stata codes enable the quiet model to suppress undesired messages. This cannot be disabled in Stata per my investigation. So it appears that I need to detect collinearity and remove collinear column(s) in Python as well.

. reg growth time*,
note: time omitted because of collinearity

      Source |       SS       df       MS              Number of obs =     381
-------------+------------------------------           F(  4,   376) =  126.10
       Model |  37.6005042     4  9.40012605           Prob > F      =  0.0000
    Residual |  28.0291465   376  .074545602           R-squared     =  0.5729
-------------+------------------------------           Adj R-squared =  0.5684
       Total |  65.6296507   380  .172709607           Root MSE      =  .27303

------------------------------------------------------------------------------
      growth |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        time |          0  (omitted)
       time2 |  -.0098885   .0009231   -10.71   0.000    -.0117037   -.0080734
       time3 |   .0000108   1.02e-06    10.59   0.000     8.77e-06    .0000128
       time4 |  -4.40e-09   4.20e-10   -10.47   0.000    -5.22e-09   -3.57e-09
       time5 |   6.37e-13   6.15e-14    10.35   0.000     5.16e-13    7.58e-13
       _cons |   3322.727   302.7027    10.98   0.000     2727.525     3917.93
------------------------------------------------------------------------------

Upvotes: 3

Views: 2269

Answers (2)

Nick Cox
Nick Cox

Reputation: 37208

The predictors are the 1st ... 5th powers of time which varies between 1627 and 2007 (presumably a calendar year, not that it matters). Even with modern software it would have been prudent to shift the origin of time to reduce the numerical strain, e.g. to work with powers of (time - 1800).

Any way, redoing the regression shows that Stata drops the first predictor as collinear. What happens in Python and R? These are different reactions to a numerically tricky challenge.

(Fitting a quintic polynomial rarely has scientific value, but that may not be of concern here. The fitted curve based on powers 2 to 5 doesn't work very well for these data, which appear economic. It makes more sense that the first 5 residuals are all positive, but that isn't true of them all!)

enter image description here

Upvotes: 2

CT Zhu
CT Zhu

Reputation: 54330

It is a wild card issue. In your Stata code time* will match time2, time3... but not time. If the Python code is changed to lr = linear_regression(df, 'growth', 'time2 time3 time4 time5') it will crank out the exact same result.

Edit

Appears Stata dropped the 1st independent variable. The fit can be visualized as follows:

lr1 = linear_regression(df, 'growth', 'time time2 time3 time4 time5')
lr2 = linear_regression(df, 'growth', 'time2 time3 time4 time5')
pred_x1 = ((np.linspace(1620, 2000)[..., np.newaxis]**np.array([1,2,3,4,5]))*lr1.coef_).sum(1)+lr1.intercept_
pred_x2 = ((np.linspace(1620, 2000)[..., np.newaxis]**np.array([2,3,4,5]))*lr2.coef_).sum(1)+lr2.intercept_
plt.plot(np.linspace(1620, 2000), pred_x1, label='Python/R fit')
plt.plot(np.linspace(1620, 2000), pred_x2, label='Stata fit')
plt.plot(df.time, df.growth, '+', label='Data')
plt.legend(loc=0)

enter image description here

And the residual sum of squares:

In [149]:

pred1 = (df.time.values[..., np.newaxis]**np.array([1,2,3,4,5])*lr1.coef_).sum(1)+lr1.intercept_
pred2 = (df.time.values[..., np.newaxis]**np.array([2,3,4,5])*lr2.coef_).sum(1)+lr2.intercept_
print 'Python fit RSS',((pred1 - df.growth.values)**2).sum()
print 'Stata fit RSS',((pred2 - df.growth.values)**2).sum()
Python fit RSS 7.2062436549
Stata fit RSS 28.0291464826

Upvotes: 1

Related Questions