Arun
Arun

Reputation: 2478

Scaling columns pandas DataFrame

For a dataset having different numeric columns, they usually have different range and distributions. As an example, I have used the Iris dataset. The distributions of it's 4 columns are shown:

petal_length

petal_width

sepal_length

sepal_width

My question is:

Should columns having similar distributions use same scaler? In this case, petal length & petal width have similar distributions. Also, sepal length & sepal width have (approximately) similar distributions. Therefore, I have used Min-Max scaler for the columns - petal length & petal width, while Standard scaler for sepal length & sepal.

The sample code for these sets of operations are:

# According to distribution visualizations from above, appropriate scalers are used-
std_scaler = StandardScaler()
iris_data[['sepallength', 'sepalwidth']] = std_scaler.fit_transform(iris_data[['sepallength', 'sepalwidth']])

# 'StandardScaler' subtracts the mean from each feature/attribute and then
# scales to unit variance

# Sanity checks-
iris_data['sepallength'].min(), iris_data['sepallength'].max()
# (-1.870024133847019, 2.4920192021244283)

iris_data['sepalwidth'].min(), iris_data['sepalwidth'].max()
# (-2.438987252491841, 3.1146839106774356)


mm_scaler = MinMaxScaler()
iris_data[['petallength', 'petalwidth']] = mm_scaler.fit_transform(iris_data[['petallength', 'petalwidth']])

# Sanity checks-
iris_data['petallength'].min(), iris_data['petallength'].max()
# (0.0, 1.0)

iris_data['petalwidth'].min(), iris_data['petalwidth'].max()
# (0.0, 1.0)

Due to standard scaler, the range for sepal length and sepal width are different. While, the range for petal length and petal width are the same. Is this a problem, since different columns are on different range which might affect the ML model using them for training?

Is there a golden set of rules for scaling/handling different numeric columns/attributes within a given dataset?

Upvotes: 0

Views: 260

Answers (1)

JonnDough
JonnDough

Reputation: 897

It depends a bit on the algorithm that you use for your task. For example, tree-based algorithms (RandomForest, XGBoost) tend to be less affected by scale differences (e.g. scale invariant, though this is not completely true because its performance can increase if you scale the variables). On the other hand, SVM and logistic regression require scaling to prevent specific features with large values and high variances dominating the model.

In general, I tend to use StandardScaler() but sometimes the performance of the model is better with MinMaxScaler(). This is a trial and error approach, I suppose. I am unaware of consensus that one form of scaling is better than the other, but I am by no means an expert. Nonetheless, I would advise to use one form of scaling for all features (either StandardScaler() or MinMaxScaler() given that all your features are continous features) for comparability and to counter your problem of different distributions and thus weights.

I do think this question was maybe more suited for CrossValidated instead of StackOverflow.

Upvotes: 1

Related Questions