Reputation: 35
I'm trying to create a regression model to predict some housing sales and I am facing an issue with processing the train data and test data (this is not the validation data taken from the training set itself) the same way. The steps I'm performing for the processing are follows:
Say my train data has the following columns (after label extraction) (the ones in ** ** contain null values):
['col1', 'col2', '**col3**', 'col4', '**col5**', 'col6', '**col7**','**col8**', '**col9**', '**col10**', 'col11']
test data has the following columns:
['col1', '**col2**', 'col3', 'col4', 'col5', 'col6', '**col7**', '**col8**', '**col9**', '**col10**', 'col11']
I only drop those columns with >50% null values and the rest of the columns in bold, I impute. Say, in the train data, I will have:
cols_to_drop= ['**col3**','**col5**','**col7**' ]
cols_to_impute= ['**col8**', '**col9**','**col10**' ]
And if I retain the same columns to be dropped from test data too, my test data will have the following:
cols_to_drop= ['**col3**','**col5**','**col7**' ]
cols_to_impute= ['**col2**', '**col8**', '**col9**','**col10**' ]
The problem now comes with imputation where I have to .fit_transform
my imputer with the cols_to_impute
in train data and have to .transform
the same imputer with the cols_to_impute
in the test data since there is a clear difference in the number of features supplied here in both the cols_to_impute
lists. (I did this as well and had issues with imputation)
Say, if I keep the same cols_to_impute
in both train and test datasets ignoring the null column **col2**
of test data, I faced an issue when it came to one-hot encoding saying Nan's
need to be handled before encoding. So, how should the processing be done for train and test sets in such cases? Should I be concatenating both of them, perform processing and split them later again? I read about leakage issues in doing this.
Upvotes: 0
Views: 1096
Reputation: 21749
Well, you should do the following:
train
and test
dataframe, then do the first two steps i.e. dropping the column with nulls and imputing them.train
and test
, then do one hot encoding.This would ensure that both the data frames have same columns and there is no leakage in doing one hot encoding.
Upvotes: 0