Reputation: 1523
I'm not sure if i'm doing something wrong, or if this is not the correct way to do this..
I'm encoding variables in a dataset for a model, now, i'm using a Normalizer()
from sklearn.preprocessing
to normalize one of my variables which is numerical.
My dataset is split in two, one for the training and one for the inference. Now, my goal is to normalize this numerical variable (let's call it column x) in the training subset, and then use the normalization parameters to normalize the same variable in the inference dataset. Now, both subsets don't have the same amount of entries, so, what i'm doing is:
nr = Normalizer()
nr.fit([df1.x])
new_col = nr.transform(df1.x)
Now, the problme is.. when i try to use the same normalizer parameters on the column x in the inference subset, since it has a different number of rows:
new_col1 = nr.transform(df2.x)
I get:
X has 10 features, but Normalizer is expecting 697 features as input.
I'm not sure if it's some reshape problem or if the Normalizer() shouldn't be used in that way, so, any advice would be more than welcome.
Upvotes: 1
Views: 578
Reputation: 5324
Normalizer
is used to normalize rows whereas StandardScaler
is used to normalize column. Concerning your questions, it seems that you want to scale columns. Therefore you should use StandardScaler
.
scikit-learn transformers excepts 2D array as input of shape (n_sample, n_feature)
but pandas.Series
are one-dimensional ndarray with axis labels.
You can fix that by passing a pandas.DataFrame
to the transformer.
As follows:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
df1 = pd.DataFrame({'x' : np.random.uniform(low=0, high=10, size=1000)})
df2 = pd.DataFrame({'x' : np.random.uniform(low=0, high=10, size=850)})
scaler = StandardScaler()
new_col = scaler.fit_transform(df1[['x']])
new_col1 = scaler.transform(df2[['x']])
Upvotes: 1