rgk
rgk

Reputation: 866

Logistic Regression with just ONE numeric feature

What is the right way to use scikit-learn's LogisticRegression solver when you have just one numeric feature?

I ran a simple example that I found hard to explain. Can anyone please explain what I am doing wrong here?

import pandas
import numpy as np
from sklearn.linear_model import LogisticRegression

X = [1, 2, 3, 10, 11, 12]
X = np.reshape(X, (6, 1))
Y = [0, 0, 0, 1, 1, 1]
Y = np.reshape(Y, (6, 1))

lr = LogisticRegression()

lr.fit(X, Y)
print ("2 --> {0}".format(lr.predict(2)))
print ("4 --> {0}".format(lr.predict(4)))

This is the output I get when the script finishes running. Shouldn't the prediction for 4 be 0 since according to the Gaussian distribution 4 is nearer to the distribution that according to the test set is classified as 0?

2 --> [0]
4 --> [1]

What is the approach Logistic Regression takes when you have just one column with numeric data?

Upvotes: 3

Views: 7068

Answers (2)

I changed some things in your code and the expected results appeared:

import numpy as np
from sklearn.linear_model import LogisticRegression

X_train = np.array([1, 2, 3, 10, 11, 12]).reshape(-1, 1)
y_train = np.array([0, 0, 0, 1, 1, 1])

logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
results = logistic_regression.predict(np.array([2,4,6.4,6.5]).reshape(-1,1))

print('2--> {}'.format(results[0]))
print('4--> {}'.format(results[1]))
print('6.4 --> {}'.format(results[2]))
print('6.5 --> {}'.format(results[3]))

The results are:

'2--> 0'
'4--> 0'
'6.4--> 0'
'6.5--> 1'

I think that you got the wrong results because you don't need to reshape the Y array...

Upvotes: 0

Simon
Simon

Reputation: 10150

You're handling a single feature correctly, but you're incorrectly assuming that just because 4 is close to the 0 class features that it would also be predicted as such

You can plot your training data along with the sigmoid function, assuming a threshold of y=0.5 for classification, and using the learned coefficients and intercepts from your regression model:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

X = [1, 2, 3, 10, 11, 12]
X = np.reshape(X, (6, 1))
Y = [0, 0, 0, 1, 1, 1]
Y = np.reshape(Y, (6, 1))

lr = LogisticRegression()
lr.fit(X, Y)

plt.figure(1, figsize=(4, 3))
plt.scatter(X.ravel(), Y, color='black', zorder=20)

def model(x):
    return 1 / (1 + np.exp(-x))

X_test = np.linspace(-5, 15, 300)
loss = model(X_test * lr.coef_ + lr.intercept_).ravel()

plt.plot(X_test, loss, color='red', linewidth=3)
plt.axhline(y=0, color='k', linestyle='-')
plt.axhline(y=1, color='k', linestyle='-')
plt.axhline(y=0.5, color='b', linestyle='--')
plt.axvline(x=X_test[123], color='b', linestyle='--')

plt.ylabel('y')
plt.xlabel('X')
plt.xlim(0, 13)
plt.show()

Here is what the sigmoid function looks like in your case:

enter image description here

Zoomed in a bit:

enter image description here

For your particular model, the value of X when Y is at the 0.5 classification threshold is somewhere between 3.161 and 3.227. You can check this by comparing the loss and X_test arrays (X_test[123] is the X value associated with the upper bound - you can use some function optimization method to get an exact value, if you want)

So the reason why 4 is being predicted as class 1 is because 4 is above that bound for when Y == 0.5

You can further show this with the following:

print ("2 --> {0}".format(lr.predict(2)))
print ("3 --> {0}".format(lr.predict(3)))
print ("3.1 --> {0}".format(lr.predict(3.1)))
print ("3.3 --> {0}".format(lr.predict(3.3)))
print ("4 --> {0}".format(lr.predict(4)))

Which will print out the following:

2 --> [0]
3 --> [0]
3.1 --> [0]  # Below threshold
3.3 --> [1]  # Above threshold
4 --> [1]

Upvotes: 6

Related Questions