Reputation: 275
I would like to have a feedback if I am doing this correctly or not. I have a binary classification problem to do so the first step I need to do is to scale the data.
I use StandardScaler with my features (all numerical continuous values) and Binarizer with my target variable (binary value).
My dataframe df is as below:
Date Regime Label feat1 feat2 feat3 feat4
1960-09-01 Recession 1.0 -0.1 120 5555.2 0.006
1960-10-01 Recession 1.0 0.6 140 6585.9 -0.001
1960-11-01 Recession 0.0 0.0 90 4567 -0.002
...
Now I split train and test and I scale them in a different way, training set I use fit_transfomr (for both training features and training target) and for the test/validation set I use transform (for both validation features and validation target).
df_idx = df[df.Date == '1996-12-01'].index[0]
df_targets = df['Label'].values
df_features = df.drop(['Regime','Date','Label'], axis=1)
#scaling training features
df_training_features_ = df_features.iloc[:df_idx,:]
scaler=StandardScaler()
df_training_features = scaler.fit_transform(df_training_features_)
#scaling validation features
df_validation_features_ = df_features.iloc[df_idx:, :]
df_validation_features = scaler.transform(df_validation_features_)
#scaling training target
df_training_targets_ = df_targets[:df_idx]
lb = preprocessing.Binarizer(threshold = 0.5)
df_training_targets = lb.fit_transform(df_training_targets_.reshape(1, -1))[0]
#scaling validation target
df_validation_targets_ = df_targets[df_idx:]
df_validation_targets = lb.transform(df_validation_targets_.reshape(1, -1))[0]
After this I will then start with my hyperparameter tuning,feature selection and model construction but I am struggling a bit to get if this is right or wrong.
May you please confirm if this is correct?
Upvotes: 0
Views: 288
Reputation: 2042
Be careful!
In your question you say that you will use transform
for both training and validation targets. But in your code, for the validation targets, you are using fit_transform
. You should use, as you said:
df_validation_targets = lb.fit_transform(df_validation_targets_.reshape(1, -1))[0]
By the way, I suppose do you have a test data? If not, fitting hyperparameters with your validation test, and obtaining the accuracy is not a good way to do it. You should change hyperparameters using validation data, and finally, test your model with test data that the model has never seen.
PD: I don't know if using fit_transform with only one sample works (try it)
Upvotes: 0