Kubaaaa
Kubaaaa

Reputation: 21

Different probabilities output in logistic regression - sklearn and Stata

this is the first time I have tried to use logistic regression in python and I ran into a serious problem.

The result that interests me is not zero-one outputs meaning "decision" about the choice in logit. Rather, I would like to get the probabilities of selecting an option. And I also have probabilities output from Stata for comparison.

I have three lists with data: scLL, scLL2, and scLLchoice. I read that Stata ignores NaN data, so because sometimes scLL and scLL2 may include NaN I tried two different approaches. I just deleted rows in scLL, scLL2, and scLLchoice if data in row was NaN. Another option was to use imputer - (as you can see in code ) but the effect was this same. Here is a code:

X=list(zip(scLL, scLL2))
y=scLLchoice
probStim2a=[]



imputer = SimpleImputer(missing_values=float("nan"), strategy='mean')
X_imputed = imputer.fit_transform(X)
logistic_regression= LogisticRegression()
logistic_regression.fit(X_imputed,y)
y_pred=logistic_regression.predict(X_imputed)


probStim2a=list(logistic_regression.predict_proba(X_imputed)[:,1])

for i in range(len(probStim2a)):
        print (probStim2a[i])

And this is part of my output:

0.5146826469187935
0.5891472587984292
0.596839657578841
0.5570721046966637
0.35240422902193136

The problem is that output from Stata is quite different:

0,5313423
0,6109276
0,6185878
0,5741577
0,3578928

I checked if inputs are the same (and they are). I also tried to use statsmodel for comparison and once again I had different outputs (not only different than data from Stata, but also in comparison to sklearn outputs). Is it possible that these different tools generate so different outputs and it is not some error in my code? I know that this is operating on probabilities but still, 2 + 2 shouldn't equal 5 just because I used a different calculator...

What should I do?

Upvotes: 2

Views: 360

Answers (1)

desertnaut
desertnaut

Reputation: 60321

Well, arguably the results are not that different; they do bear a qualitative similarity, which implies some different setting between the two implementations.

The most probable suspect in this case is that sklearn's Logistic Regression (LR) by default uses the penalty='l2' argument (docs); in other words, sklearn's implementation is not "vanilla" LR but actually Ridge LR. I am not familiar with Stata (and you do not post the relevant code), but from their own documentation it would seem that this is not the case here, hence the different results.

To obtain the same results, try removing the L2 penalty in sklearn, i.e.:

logistic_regression= LogisticRegression(penalty='none')

Regarding statsmodels, they have their own idiosyncratic defaults (namely without an intercept); see the Cross Validated thread Logistic Regression: Scikit Learn vs Statsmodels for more details.

Upvotes: 1

Related Questions