Alexander Tverdohleb
Alexander Tverdohleb

Reputation: 453

Python Logistic Regression Produces Wrong Coefficients

I am trying to use LogisticRegression model fro scikit-learn to solve Excercise 2 from Machine Learning course by Andrew Ng on Coursera. But the result, which I get is wrong:

1) Outcome coefficients doen't match with answers:

What I get with a model

LogisticRegression Outcome

What I should get according to answers

[-25.16, 0.21, 0.20]

you can see on the plot (wrong graph), that decision boundary seems to be a little bit below the decision boundary by intuition.

2) Graph outcome seems wrong

As you can see, decision boundary is below

LogisticRegression LogisticRegression Plot

Answer

Answers Plot

MY CODE:

% matplotlib notebook



# IMPORT DATA

ex2_folder = 'machine-learning-ex2/ex2'
input_1 = pd.read_csv(folder + ex2_folder +'/ex2data1.txt', header = None)
X = input_1[[0,1]]
y = input_1[2]


# IMPORT AND FIT MODEL

from sklearn.linear_model  import LogisticRegression
model = LogisticRegression(fit_intercept = True)
model.fit(X,y)
print('Intercept (Theta 0: {}). Coefficients: {}'.format(model.intercept_, model.coef_))



# CALCULATE GRID
n = 5

xx1, xx2 = np.mgrid[25:101:n, 25:101:n]
grid = np.c_[xx1.ravel(), xx2.ravel()]
probs = model.predict_proba(grid)[:, 1]
probs = probs.reshape(xx1.shape)


# PLOTTING

f = plt.figure()
ax = plt.gca()


for outcome in [0,1]:
    xo = 'yo' if  outcome == 0 else 'k+'
    selection = y == outcome
    plt.plot(X.loc[selection, 0],X.loc[selection,1],xo, mec = 'k')
plt.xlim([25,100])
plt.ylim([25,100])

plt.xlabel('Exam 1 Score')
plt.ylabel('Exam 2 Score')
plt.title('Exam 1 & 2 and admission outcome')

contour = ax.contourf(xx1,xx2, probs, 100, cmap="RdBu",
                      vmin=0, vmax=1)
ax_c = f.colorbar(contour)
ax_c.set_label("$P(y = 1)$")
ax_c.set_ticks([0, .25, .5, .75, 1])

plt.contour(xx1, xx2, probs, [0.5], linewidths=1, colors='b', alpha = 0.3);

plt.plot(xx1[probs > 0.5], xx2[probs > 0.5],'.b', alpha = 0.3)

LINKS

DataFile in txt

PDF Tasks and Solutions in Octave

Upvotes: 2

Views: 1208

Answers (1)

Simas Joneliunas
Simas Joneliunas

Reputation: 3136

To get identical results you need to create identical testing conditions.

One obvious difference at a glance is the iteration count. Sklearn LogisticRegression classifier default iteration count is 100, while Andrew NG's sample code runs for 400 iterations. That will certainly give you a different result from Nguyen's course.

I am not sure anymore which cost function Nguyen is using for the exercise, but I am pretty sure it's cross entropy and not the L2 that is default function for LogisticRecression classifier in scikit learn.

And the last note, before you implement higher-level solutions (scikitlearn/tensorflow/keras), you should first try to implement them in pure python to understand how they work. It will be easier (and more fun) to try and make higher-level packages to work for you.

Upvotes: 3

Related Questions