Reputation: 9345
I'm having some trouble normalizing my data in Pandas. I've created a model and am trying to use it to predict.
First, I have this:
_text_img_count _text_vid_count _text_link_count _text_par_count ...
0 2 0 6
Then I normalize as follows:
x = numeric_df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
numeric_df_normalized = pd.DataFrame(x_scaled)
Now, numeric_df_normalized
looks like this:
0 1 2 3 4 5 6 7 8 9 ... 13 14 15 16 \
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
17 18 19 20 21 22
0 0.0 0.0 0.0 0.0 0.0 0.0
So I've lost my column names and my values are all 0.
Finally, I try to add back the old column names from the original numeric_df
as follows:
numeric_df_normalized = pd.DataFrame(numeric_df_normalized, columns=numeric_df.columns)
I get back:
_text_img_count _text_vid_count _text_link_count ...
NaN NaN NaN
So a few questions:
1) Why does normalization cause me to lose my column names and set them to 0?
2) Why does adding back the column names from numeric_df
cause my 0s to be converted to NaNs?
Thanks!
Upvotes: 1
Views: 2411
Reputation: 3621
1) Why does normalization cause me to lose my column names and set them to 0?
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
So when data min == data max, the scaled result is 0.
2) Why does adding back the column names from numeric_df cause my 0s to be converted to NaNs?
Note that numeric_df_normalized
is a dataframe already, so pd.DataFrame(numeric_df_normalized, columns=numeric_df.columns)
would try to match the current dataframe with new columns. Because there is no matching column name, the resulted data is NaN.
Upvotes: 3
Reputation: 19634
If you want to transform the result to a dataframe with the same structure, you can do:
numeric_df_normalized.columns=numeric_df.columns
numeric_df_normalized.index=numeric_df.index
(the second line is in case you had an index as well) instead of
numeric_df_normalized = pd.DataFrame(numeric_df_normalized, columns=numeric_df.columns)
Regarding the 0's, this may happen if the values in the first row are the smallest possible values for each one of the features. Then when they are scaled they will be transformed to 0.
For example, consider the following normalization:
from sklearn import preprocessing
df=pd.DataFrame({'a':[1,2],'b':[3,4]})
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(df.values)
x_scaled
is
array([[ 0., 0.],
[ 1., 1.]])
So the upper left 1 became 0 (since 1<2) and the upper right 3 became 0 (since 3<4).
Upvotes: 1