Reputation: 9345

Pandas: Getting 0s and the NaNs when Normalizing Data

I'm having some trouble normalizing my data in Pandas. I've created a model and am trying to use it to predict.

First, I have this:

_text_img_count  _text_vid_count  _text_link_count  _text_par_count  ...
0                2                0                 6

Then I normalize as follows:

    x = numeric_df.values #returns a numpy array
    min_max_scaler = preprocessing.MinMaxScaler()
    x_scaled = min_max_scaler.fit_transform(x)
    numeric_df_normalized = pd.DataFrame(x_scaled)

Now, numeric_df_normalized looks like this:

 0    1    2    3    4    5    6    7    8    9  ...    13   14   15   16  \
0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 ...   0.0  0.0  0.0  0.0   

    17   18   19   20   21   22  
0  0.0  0.0  0.0  0.0  0.0  0.0

So I've lost my column names and my values are all 0.

Finally, I try to add back the old column names from the original numeric_df as follows:

numeric_df_normalized = pd.DataFrame(numeric_df_normalized, columns=numeric_df.columns)

I get back:

_text_img_count  _text_vid_count  _text_link_count ...
            NaN              NaN               NaN

So a few questions:

1) Why does normalization cause me to lose my column names and set them to 0?

2) Why does adding back the column names from numeric_df cause my 0s to be converted to NaNs?

Thanks!

Upvotes: 1

Answers (2)

THN

Reputation: 3621

1) Why does normalization cause me to lose my column names and set them to 0?

MinMaxScaler:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

So when data min == data max, the scaled result is 0.

2) Why does adding back the column names from numeric_df cause my 0s to be converted to NaNs?

Note that numeric_df_normalized is a dataframe already, so pd.DataFrame(numeric_df_normalized, columns=numeric_df.columns) would try to match the current dataframe with new columns. Because there is no matching column name, the resulted data is NaN.

Upvotes: 3

Miriam Farber

Reputation: 19634

If you want to transform the result to a dataframe with the same structure, you can do:

numeric_df_normalized.columns=numeric_df.columns
numeric_df_normalized.index=numeric_df.index

(the second line is in case you had an index as well) instead of

numeric_df_normalized = pd.DataFrame(numeric_df_normalized, columns=numeric_df.columns)

Regarding the 0's, this may happen if the values in the first row are the smallest possible values for each one of the features. Then when they are scaled they will be transformed to 0.

For example, consider the following normalization:

from sklearn import preprocessing

df=pd.DataFrame({'a':[1,2],'b':[3,4]})

min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(df.values)

x_scaled is

array([[ 0.,  0.],
       [ 1.,  1.]])

So the upper left 1 became 0 (since 1<2) and the upper right 3 became 0 (since 3<4).

Upvotes: 1

Pandas: Getting 0s and the NaNs when Normalizing Data

Answers (2)

Related Questions