Lin Ma
Lin Ma

Reputation: 10139

how scikit learn figure out logistic regression for classification or regression

I think logistic regression could be used for both regression (get number between 0 and 1, e.g. using logistic regression to predict a probability between 0 and 1) and classification. The question is, it seems after we provide the training data and target, logistic regression could automatically figure out if we are doing a regression or doing a classification?

For example, in below example code, logistic regression figured out we just need output to be one of the 3 class 0, 1, 2, other than any number between 0 and 2? Just curious how logistic regression automatically figured out whether it is doing a regression (output is a continuous range) or classification (output is discrete) problem?

http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html

print(__doc__)


# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

h = .02  # step size in the mesh

logreg = linear_model.LogisticRegression(C=1e5)

# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())

plt.show()

Upvotes: 0

Views: 1572

Answers (1)

John Yetter
John Yetter

Reputation: 251

Logistic regression often uses a cross-entropy cost function, which models loss according to a binary error. Also, the output of logistic regression usually follows a sigmoid at the decision boundary, meaning that while the decision boundary may be linear, the output (often viewed as a probability of the point representing one of two classes on either side of the boundary) transitions in non-linear fashion. This would make your regression model from 0 to 1 a very particular, non-linear function. That might be desirable in certain circumstances, but is probably not generally desirable.

You can think of logistic regression as providing an amplitude that represents probability of being in a class, or not. If you consider a binary classifier with two independent variables, you can picture a surface where the decision boundary is the topological line where probability is 0.5. Where the classifier is certain the of the class, the surface is either on the plateau (probability = 1) or in the low lying region (probability = 0). The transition from low probability regions to high follows a sigmoid function, usually.

You might look at Andrew Ng's Coursera course, which has a set of classes on logistic regression. This is the first of the classes. I have a github repo that is the R version of that class's output, here, which you might find helpful in understanding logistic regression better.

Upvotes: 1

Related Questions