Reputation: 781
I have a training set with one feature (credit balance) - numbers varying between 0-20,000. The response is either 0 (Default=No) or 1 (Default=Yes). This was a simulated training set generated using logistic function. For reference it is available here.
The following boxplot shows the distribution of the balance for default=yes and default=no classes respectively -
The following is the distribution of the data -
Also the dataset is perfectly balanced with 50% data for each response class. So it is a classic case suitable for application of Logistic Regression. However, on applying Logistic regression the score comes out to be 0.5 because only y=1 is being predicted. The following is the way in which Logistic Regression is being applied -
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(df[['Balance']],df['Default'])
clf.score(df[['Balance']], df['Default'])
This is proof that something must be off with the way Logistic Regression fits this data. When the balance feature is scaled though, the score improves to 87.5%. So does scaling play a factor here?
Edit: Why does scaling play a factor here? The documentation of Logistic Regression in sklearn says that lbfgs
solver is robust to unscaled data.
Upvotes: 2
Views: 941
Reputation: 313
Not only this, If you scale it to any value, i.e. df['balances']/2 or df['balances']/1000 or df['balance']*2, all would probably give ~87% accuracy, depending on random state selected by default it'd give 87% or 50%
The underlying implementation uses a random number generator to fit model, so not uncommon to have different solutions, in the case in question the classes are not linearly seperable, so it might not give a solution and it definitely won't give you a good solution always.
You can find the solution when you change the random state parameter, hence it is probably a good idea to score the model multiple times to get an average of performance
[EDIT] Also https://scikit-learn.org/stable/modules/linear_model.html#liblinear-differences is mentioned solver's robustness to not scaling and speed on large datasets
Upvotes: 2