Reputation: 866
What is the right way to use scikit-learn
's LogisticRegression
solver when you have just one numeric feature?
I ran a simple example that I found hard to explain. Can anyone please explain what I am doing wrong here?
import pandas
import numpy as np
from sklearn.linear_model import LogisticRegression
X = [1, 2, 3, 10, 11, 12]
X = np.reshape(X, (6, 1))
Y = [0, 0, 0, 1, 1, 1]
Y = np.reshape(Y, (6, 1))
lr = LogisticRegression()
lr.fit(X, Y)
print ("2 --> {0}".format(lr.predict(2)))
print ("4 --> {0}".format(lr.predict(4)))
This is the output I get when the script finishes running. Shouldn't the prediction for 4 be 0 since according to the Gaussian distribution 4 is nearer to the distribution that according to the test set is classified as 0?
2 --> [0]
4 --> [1]
What is the approach Logistic Regression takes when you have just one column with numeric data?
Upvotes: 3
Views: 7068
Reputation: 1
I changed some things in your code and the expected results appeared:
import numpy as np
from sklearn.linear_model import LogisticRegression
X_train = np.array([1, 2, 3, 10, 11, 12]).reshape(-1, 1)
y_train = np.array([0, 0, 0, 1, 1, 1])
logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
results = logistic_regression.predict(np.array([2,4,6.4,6.5]).reshape(-1,1))
print('2--> {}'.format(results[0]))
print('4--> {}'.format(results[1]))
print('6.4 --> {}'.format(results[2]))
print('6.5 --> {}'.format(results[3]))
The results are:
'2--> 0'
'4--> 0'
'6.4--> 0'
'6.5--> 1'
I think that you got the wrong results because you don't need to reshape the Y array...
Upvotes: 0
Reputation: 10150
You're handling a single feature correctly, but you're incorrectly assuming that just because 4 is close to the 0 class features that it would also be predicted as such
You can plot your training data along with the sigmoid function, assuming a threshold of y=0.5
for classification, and using the learned coefficients and intercepts from your regression model:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
X = [1, 2, 3, 10, 11, 12]
X = np.reshape(X, (6, 1))
Y = [0, 0, 0, 1, 1, 1]
Y = np.reshape(Y, (6, 1))
lr = LogisticRegression()
lr.fit(X, Y)
plt.figure(1, figsize=(4, 3))
plt.scatter(X.ravel(), Y, color='black', zorder=20)
def model(x):
return 1 / (1 + np.exp(-x))
X_test = np.linspace(-5, 15, 300)
loss = model(X_test * lr.coef_ + lr.intercept_).ravel()
plt.plot(X_test, loss, color='red', linewidth=3)
plt.axhline(y=0, color='k', linestyle='-')
plt.axhline(y=1, color='k', linestyle='-')
plt.axhline(y=0.5, color='b', linestyle='--')
plt.axvline(x=X_test[123], color='b', linestyle='--')
plt.ylabel('y')
plt.xlabel('X')
plt.xlim(0, 13)
plt.show()
Here is what the sigmoid function looks like in your case:
Zoomed in a bit:
For your particular model, the value of X
when Y
is at the 0.5 classification threshold is somewhere between 3.161
and 3.227
. You can check this by comparing the loss
and X_test
arrays (X_test[123]
is the X value associated with the upper bound - you can use some function optimization method to get an exact value, if you want)
So the reason why 4 is being predicted as class 1
is because 4 is above that bound for when Y == 0.5
You can further show this with the following:
print ("2 --> {0}".format(lr.predict(2)))
print ("3 --> {0}".format(lr.predict(3)))
print ("3.1 --> {0}".format(lr.predict(3.1)))
print ("3.3 --> {0}".format(lr.predict(3.3)))
print ("4 --> {0}".format(lr.predict(4)))
Which will print out the following:
2 --> [0]
3 --> [0]
3.1 --> [0] # Below threshold
3.3 --> [1] # Above threshold
4 --> [1]
Upvotes: 6