Reputation: 21625
I'm having some trouble understanding sckit-learn's LogisticRegression() method. Here's a simple example
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
# Create a sample dataframe
data = [['Age', 'ZepplinFan'], [13, 0], [25, 0], [40, 1], [51, 0], [55, 1], [58, 1]]
columns=data.pop(0)
df = pd.DataFrame(data=data, columns=columns)
Age ZepplinFan
0 13 0
1 25 0
2 40 1
3 51 0
4 55 1
5 58 1
# Fit Logistic Regression
lr = LogisticRegression()
lr.fit(X=df[['Age']], y = df['ZepplinFan'])
# View the coefficients
lr.intercept_ # returns -0.56333276
lr.coef_ # returns 0.02368826
# Predict for new values
xvals = np.arange(-10,70,1)
predictions = lr.predict_proba(X=xvals[:,np.newaxis])
probs = [y for [x, y] in predictions]
# Plot the fitted model
plt.plot(xvals, probs)
plt.scatter(df.Age.values, df.ZepplinFan.values)
plt.show()
Obviously this doesn't appear to be a good fit. Furthermore, when I do this exercise in R I get different coefficients and a model that makes more sense.
lapply(c("data.table","ggplot2"), require, character.only=T)
dt <- data.table(Age=c(13, 25, 40, 51, 55, 58), ZepplinFan=c(0, 0, 1, 0, 1, 1))
mylogit <- glm(ZepplinFan ~ Age, data = dt, family = "binomial")
newdata <- data.table(Age=seq(10,70,1))
newdata[, ZepplinFan:=predict(mylogit, newdata=newdata, type="response")]
mylogit$coeff
(Intercept) Age
-4.8434 0.1148
ggplot()+geom_point(data=dt, aes(x=Age, y=ZepplinFan))+geom_line(data=newdata, aes(x=Age, y=ZepplinFan))
What am I missing here?
Upvotes: 2
Views: 1278
Reputation: 36086
The problem you are facing is related to the fact that scikit learn is using regularized logistic regression. The regularization term allows for controlling the trade-off between the fit to the data and generalization to future unknown data. The parameter C
is used to control the regularization, in your case:
lr = LogisticRegression(C=100)
will generate what you are looking for:
As you have discovered, changing the value of the intercept_scaling
parameter also achieves similar effect. The reason is also regularization or rather how it affects estimation of the bias in the regression. The larger intercept_scaling
parameter will effectively reduce the impact of regularization on the bias.
For more information about the implementation of LR and solvers used by scikit-learn, check: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Upvotes: 4