Do features need to be scaled in Logistic Regression?

Question

I have a training set with one feature (credit balance) - numbers varying between 0-20,000. The response is either 0 (Default=No) or 1 (Default=Yes). This was a simulated training set generated using logistic function. For reference it is available here.

The following boxplot shows the distribution of the balance for default=yes and default=no classes respectively -

The following is the distribution of the data -

Also the dataset is perfectly balanced with 50% data for each response class. So it is a classic case suitable for application of Logistic Regression. However, on applying Logistic regression the score comes out to be 0.5 because only y=1 is being predicted. The following is the way in which Logistic Regression is being applied -

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(df[['Balance']],df['Default'])
clf.score(df[['Balance']], df['Default'])

This is proof that something must be off with the way Logistic Regression fits this data. When the balance feature is scaled though, the score improves to 87.5%. So does scaling play a factor here?

Edit: Why does scaling play a factor here? The documentation of Logistic Regression in sklearn says that lbfgs solver is robust to unscaled data.

Do features need to be scaled in Logistic Regression?

Answers (1)

Related Questions