user9238790
user9238790

Reputation:

Normalization / Scaling as preprocessing step in python

I am not sure what is the name of the method exactly but I will describe it and hopefully, someone can label it and amend the question accordingly. Here is the code to create the dataset.

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=300,
                           n_features=6,
                           n_informative=4,
                           n_classes=2,
                           random_state=0,
                           shuffle=True,
                           shift  = 5,
                          scale = 10)

# Creating a dataFrame
df = pd.DataFrame({'Feature 1':X[:,0],
                                  'Feature 2':X[:,1],
                                  'Feature 3':X[:,2],
                                  'Feature 4':X[:,3],
                                  'Feature 5':X[:,4],
                                  'Feature 6':X[:,5],
                                  'Class':y})

df.describe()

enter image description here

Let's look at the output of feature 2 and feature 4 as an example to explain my point.

Assuming we only have positive values, how to make feature 2 and feature 4 between the ranges of 0 to 1 in accordance with the range of the values in their columns.

Let me further illustrate. Feature 2 and feature 4 min value would change to 0 and max value would be 1. However, from the above, we can see that feature 2 maximum value is around 73 and feature 4 max value is 91. The idea is to represent the change on feature 2, 73 to 71 as a bigger number in the 0 to 1 value, then 91 to 89. Although both have the same difference of change which is "2", but because of their range, the change is more significant in feature 2 in comparison to feature 4 due to the total change.

After the following is done, we would create a new dataset representing the new data.

The idea is to later remove features according to the change of value in relation to the range of the column, rather than the magnitude of change in relation to the whole dataset.

I hope this was not confusing.

Upvotes: 1

Views: 921

Answers (3)

CharuNethra Giri
CharuNethra Giri

Reputation: 36

You can use the below log transformation / normalization technique using the package ctrl4ai

This will apply log transformation over the features that are skewed/asymmetric

pip install ctrl4ai
from ctrl4ai import preprocessing
preprocessing.log_transform(dataset)

Usage: [arg1]:[pandas dataframe],[method]=['yeojohnson'/'added_constant']
Description: Checks if the a continuous column is skewed and does log transformation
Returns: Dataframe [with all skewed columns normalized using appropriate approach]

Upvotes: 0

SergiyKolesnikov
SergiyKolesnikov

Reputation: 7815

I suppose you are looking for the MinMaxScaler from the sklearn.preprocessing module.

The sklearn.preprocessing module includes scaling, centering, normalization, binarization and imputation methods.

If you want to rescale the original data "inplace" (i.e., replace the original values with the rescaled ones) then you can do it like this:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(copy=False)
scaler.fit_transform(df['Feature 2'].values.reshape(-1, 1))
scaler.fit_transform(df['Feature 4'].values.reshape(-1, 1))

df[['Feature 2', 'Feature 4']].describe()

Output:

        Feature 2   Feature 4
count  300.000000  300.000000
mean     0.563870    0.475371
std      0.189137    0.179086
min      0.000000    0.000000
25%      0.439482    0.344611
50%      0.566084    0.471282
75%      0.695583    0.593683
max      1.000000    1.000000

Upvotes: 2

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210982

Demo:

from sklearn.preprocessing import StandardScaler, MinMaxScaler

mms = MinMaxScaler()

df.loc[:, df.columns.str.contains('Feature')] = mms.fit_transform(df.filter(like='Feature'))

yields:

In [164]: df
Out[164]:
     Class  Feature 1  Feature 2  Feature 3  Feature 4  Feature 5  Feature 6
0        0   0.416385   0.666981   0.510885   0.530803   0.676278   0.443090
1        0   0.556001   0.473475   0.401624   0.272491   0.376577   0.699309
2        0   0.510970   0.617226   0.603038   0.449458   0.703408   0.388056
3        1   0.674764   0.590244   0.639278   0.203411   0.594984   0.289978
4        0   0.284630   0.707643   0.357078   0.653500   0.641764   0.484258
5        0   0.487175   0.566235   0.469849   0.414133   0.550115   0.550655
6        1   0.425064   0.354257   0.452126   0.625156   0.673901   0.641468
7        0   0.412525   0.617383   0.446962   0.536107   0.651904   0.414641
8        0   0.509887   0.382452   0.511992   0.556738   0.768706   0.291556
9        0   0.580941   0.452781   0.534328   0.326482   0.518002   0.641739
..     ...        ...        ...        ...        ...        ...        ...
290      0   0.728144   0.151289   0.692940   0.409269   0.834617   0.214392
291      1   0.377372   0.169778   0.405410   0.776607   0.736210   0.732727
292      0   0.519530   0.360764   0.503794   0.530192   0.723015   0.374990
293      0   0.629286   0.444416   0.462688   0.194132   0.374052   0.675573
294      1   0.660195   0.675694   0.675262   0.185723   0.575563   0.364423
295      1   0.322941   0.489876   0.474006   0.746047   0.754077   0.643757
296      0   0.460637   0.500117   0.236784   0.305325   0.240014   0.862539
297      1   0.521527   0.326676   0.430562   0.455950   0.557530   0.616107
298      0   1.000000   0.000000   1.000000   0.213472   0.979327   0.012098
299      1   0.688809   0.602628   0.654906   0.184625   0.599433   0.262852

[300 rows x 7 columns]

after scaling:

In [166]: df.describe()
Out[166]:
            Class   Feature 1   Feature 2   Feature 3   Feature 4   Feature 5   Feature 6
count  300.000000  300.000000  300.000000  300.000000  300.000000  300.000000  300.000000
mean     0.500000    0.493667    0.563870    0.560114    0.475371    0.679344    0.451538
std      0.500835    0.141253    0.189137    0.162298    0.179086    0.156490    0.176866
min      0.000000    0.000000    0.000000    0.000000    0.000000    0.000000    0.000000
25%      0.000000    0.408848    0.439482    0.446708    0.344611    0.599389    0.317857
50%      0.500000    0.495316    0.566084    0.557805    0.471282    0.704260    0.457312
75%      1.000000    0.581756    0.695583    0.683460    0.593683    0.785408    0.571726
max      1.000000    1.000000    1.000000    1.000000    1.000000    1.000000    1.000000

Upvotes: 0

Related Questions