Baktaawar
Baktaawar

Reputation: 7490

When to use train_test_split of scikit learn

I have a dataset having 19 features. Now I need to do missing value imputation, then encoding the categorical variables using OneHOtEncoder of scikit and then run a machine learning algo.

My question is should I split this dataset before doing all the above things using train_test_split method of scikit or should I first split into train and test and then on each set of data, do missing value and encoding.

My concern is if I split first then do missing value and other encoding on resulting two sets, when doing encoding of variables in test set, shouldn't test set would have some values missing for that variable there maybe resulting in less no. of dummies. Like if original data had 3 levels for categorical and I know we are doing random sampling but is there a chance that the test set might not have all three levels present for that variable thereby resulting in only two dummies instead of three in first?

What's the right approach. Splitting first and then doing all of the above on train and test or do missing value and encoding first on whole dataset and then split?

Upvotes: 6

Views: 1921

Answers (1)

Arnaud Joly
Arnaud Joly

Reputation: 904

I would first split the data into a training and testing set. Your missing value imputation strategy should be fitted on the training data and applied both on the training and testing data.

For instance, if you intend to replace missing values by the most frequent value or the median. This knowledge (median, most frequent value) must be obtained without having seen the testing set. Otherwise, your missing value imputation will be biased. If some values of feature are unseen in the training data, then you can for instance increasing your overall number of samples or have a missing value imputation strategy robust to outliers.

Here is an example how to perform missing value imputation using a scikit-learn pipeline and imputer:

Upvotes: 3

Related Questions