Reputation: 1318
I am trying to build multinomial logit model using python and stata. My data is as follows:
ses_type prog_type read write math prog ses
0 low Diploma 39.2 40.2 46.2 0 0
1 middle general 39.2 38.2 46.2 1 1
2 high Diploma 44.5 44.5 49.5 0 2
3 low Diploma 43.0 43.0 48.0 0 0
4 middle Diploma 44.5 36.5 45.5 0 1
5 high general 47.3 41.3 47.3 1 2
I am trying to predict prog using ses read write and math. Where ses represent socioeconomic status and is a nominal variable therefore I created my model in stata using following command:
mlogit prog i.ses read write math, base(2)
Stata output is as follows:
Iteration 0: log likelihood = -204.09667
Iteration 1: log likelihood = -171.90258
Iteration 2: log likelihood = -170.13513
Iteration 3: log likelihood = -170.11071
Iteration 4: log likelihood = -170.1107
Multinomial logistic regression Number of obs = 200
LR chi2(10) = 67.97
Prob > chi2 = 0.0000
Log likelihood = -170.1107 Pseudo R2 = 0.1665
------------------------------------------------------------------------------
prog | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
0 |
ses |
1 | .6197969 .5059335 1.23 0.221 -.3718146 1.611408
2 | -.5131952 .6280601 -0.82 0.414 -1.74417 .7177799
|
read | -.0405302 .0289314 -1.40 0.161 -.0972346 .0161742
write | -.0459711 .0270153 -1.70 0.089 -.09892 .0069779
math | -.0990497 .0331576 -2.99 0.003 -.1640373 -.0340621
_cons | 9.544131 1.738404 5.49 0.000 6.136921 12.95134
-------------+----------------------------------------------------------------
1 |
ses |
1 | -.3350861 .4607246 -0.73 0.467 -1.23809 .5679176
2 | -.8687013 .5363968 -1.62 0.105 -1.92002 .182617
|
read | -.0226249 .0264534 -0.86 0.392 -.0744726 .0292228
write | -.011618 .0266782 -0.44 0.663 -.0639063 .0406703
math | -.0591301 .0299996 -1.97 0.049 -.1179283 -.000332
_cons | 5.041193 1.524174 3.31 0.001 2.053866 8.028519
-------------+----------------------------------------------------------------
2 | (base outcome)
------------------------------------------------------------------------------
I tried to replicate the same results using scikit learn module in python. Following is the code:
data = pd.read_csv("C://Users/Furqan/Desktop/random_data.csv")
train_x = np.array(data[['read', 'write', 'math','ses ']])
train_y = np.array(data['prog'])
mul_lr = linear_model.LogisticRegression(multi_class='multinomial',
solver='newton-cg').fit(train_x, train_y)
print(mul_lr.intercept_)
print(mul_lr.coef_)
The output values (intercept and coefficient) are as follows:
[ 4.76438772 0.19347405 -4.95786177]
[[-0.01735513 -0.02731273 -0.04463257 0.01721334]
[-0.00319366 0.00783135 -0.00689664 -0.24480926]
[ 0.02054879 0.01948137 0.05152921 0.22759592]]
The values turn out to be different.
My first question is why the results tend to be different?
My second question is that in case of nominal predictor variable how can we instruct python that ses is an indicator variable?
EDIT:
Link to data file
Upvotes: 7
Views: 3993
Reputation: 11424
There are several issues that make Stata
and sklearn
results differ:
We need to change all the three conditions to achieve similar outputs.
The formula that Stata
uses for the linear part is
prediction = a0 + a1 * [ses==1] + a2 * [ses==2] + a3 * read + a4 * write + a5 * math
Sklearn
, in turn, knows nothing about the categorical nature of ses
, and tries to use
prediction = a0 + a1 * ses + a3 * read + a4 * write + a5 * math
To enable categorical predictions, you need to preprocess the data. This is the only possible way to include categorical variables into an sklearn
logistic regression. I find pd.get_dummies()
the most convenient way to do this.
The following code creates the dummy variable for ses
and then drops the "low"
level, which evidently corresponds to ses=0
in your example:
import pandas as pd, numpy as np
from sklearn import linear_model
data = pd.read_csv("d1.csv", sep='\t')
data.columns = data.columns.str.strip()
raw_x = data.drop('prog', axis=1)
# making the dummies
train_x = pd.get_dummies(raw_x, columns=['ses']).drop('ses_low ', axis=1)
print(train_x.columns)
train_y = data['prog']
mul_lr = linear_model.LogisticRegression(multi_class='multinomial',
solver='newton-cg').fit(train_x, train_y)
reorder = [4, 3, 0, 1, 2] # the order in which coefficents show up in Stata
print(mul_lr.intercept_)
print(mul_lr.coef_[:, reorder])
It outputs
['read', 'write', 'math', 'ses_high ', 'ses_middle ']
[ 4.67331919 0.19082335 -4.86414254]
[[ 0.47140512 -0.08236331 -0.01909793 -0.02680609 -0.04587383]
[-0.36381476 -0.33294749 -0.0021255 0.00765828 -0.00703075]
[-0.10759035 0.4153108 0.02122343 0.01914781 0.05290458]]
You see that Python has successfully encoded sess
into 'ses_high '
and 'ses_middle '
, but failed to produce the expected coefficients.
By the way, I have changed the order of coef_
columns in the output, to have it look like in Stata.
That's because Stata views the third category of the outcome (prog=='honors '
) as base outcome, and subtracts all its parameters from the rest of the parameters. In Python, you can reproduce this by running
print(mul_lr.intercept_ - mul_lr.intercept_[-1])
print((mul_lr.coef_ - mul_lr.coef_[-1])[:, reorder])
which gives you
[9.53746174 5.0549659 0. ]
[[ 0.57899547 -0.4976741 -0.04032136 -0.0459539 -0.09877841]
[-0.25622441 -0.74825829 -0.02334893 -0.01148954 -0.05993533]
[ 0. 0. 0. 0. 0. ]]
You can see now that the parameters now are close to what Stata
gives:
Can you see the pattern? In sklearn
, slope coefficients are smaller (closer to zero) than in Stata. This is not an accident!
This happens because sklearn
intentionally shrinks the slope coefficients towards 0, by adding a quadratic penalty on coefficients to the likelihood function it maximizes. This makes the estimates biased but more stable, even in case of harsh multicollinearity. In Bayesian terms, such regularization corresponds to a zero-mean Gaussian prior on all the coefficients. You can learn more about regularization in the wiki.
In sklearn
, this quadratic penalty is controlled by the positive C
parameter: the smaller it is, the more regularization you get. You can think of it as the prior variance of each slope coefficient. The default value is C=1
, but you can make it larger, like C=1000000
, which means almost no regularization. In this case, the output is almost identical to that of Stata
:
mul_lr2 = linear_model.LogisticRegression(
multi_class='multinomial', solver='newton-cg', C=1000000
).fit(train_x, train_y)
print(mul_lr2.intercept_ - mul_lr2.intercept_[-1])
print((mul_lr2.coef_ - mul_lr2.coef_[-1])[:, reorder])
which gives you
[9.54412644 5.04126452 0. ]
[[ 0.61978951 -0.51320481 -0.04053013 -0.0459711 -0.09904948]
[-0.33508605 -0.86869799 -0.02262518 -0.01161839 -0.05913068]
[ 0. 0. 0. 0. 0. ]]
The results still differ a little (like in the 5-th decimal), but with even less regularization the difference fill shrink further.
Upvotes: 7