Luigi87
Luigi87

Reputation: 275

Scaling the dataset for train and test set (StandardScaler, Binarizer) with fit_transform and transform

I would like to have a feedback if I am doing this correctly or not. I have a binary classification problem to do so the first step I need to do is to scale the data.

I use StandardScaler with my features (all numerical continuous values) and Binarizer with my target variable (binary value).

My dataframe df is as below:

Date        Regime      Label   feat1    feat2       feat3      feat4
1960-09-01  Recession   1.0     -0.1     120         5555.2      0.006      
1960-10-01  Recession   1.0      0.6     140         6585.9     -0.001      
1960-11-01  Recession   0.0      0.0     90          4567       -0.002
...

Now I split train and test and I scale them in a different way, training set I use fit_transfomr (for both training features and training target) and for the test/validation set I use transform (for both validation features and validation target).

df_idx = df[df.Date == '1996-12-01'].index[0]

df_targets = df['Label'].values
df_features = df.drop(['Regime','Date','Label'], axis=1)

#scaling training features
df_training_features_ = df_features.iloc[:df_idx,:]
scaler=StandardScaler()
df_training_features = scaler.fit_transform(df_training_features_)

#scaling validation features
df_validation_features_ = df_features.iloc[df_idx:, :]
df_validation_features = scaler.transform(df_validation_features_)

#scaling training target
df_training_targets_ = df_targets[:df_idx]
lb = preprocessing.Binarizer(threshold = 0.5)
df_training_targets = lb.fit_transform(df_training_targets_.reshape(1, -1))[0]

#scaling validation target
df_validation_targets_ = df_targets[df_idx:]
df_validation_targets = lb.transform(df_validation_targets_.reshape(1, -1))[0]

After this I will then start with my hyperparameter tuning,feature selection and model construction but I am struggling a bit to get if this is right or wrong.

May you please confirm if this is correct?

Upvotes: 0

Views: 288

Answers (1)

Alex Serra Marrugat
Alex Serra Marrugat

Reputation: 2042

Be careful!

In your question you say that you will use transform for both training and validation targets. But in your code, for the validation targets, you are using fit_transform. You should use, as you said:

df_validation_targets = lb.fit_transform(df_validation_targets_.reshape(1, -1))[0]

By the way, I suppose do you have a test data? If not, fitting hyperparameters with your validation test, and obtaining the accuracy is not a good way to do it. You should change hyperparameters using validation data, and finally, test your model with test data that the model has never seen.

PD: I don't know if using fit_transform with only one sample works (try it)

Upvotes: 0

Related Questions