anon_swe
anon_swe

Reputation: 9335

Scikit-Learn Pipeline: How to Handle Preprocessing

I'm doing some machine learning practice on Kaggle and I'm beginning to use the sklearn.pipeline.Pipeline class to transform my data several times and then train a model on it.

I want to encapsulate several parts of pre-processing my data: dropping rows with 30% or more NaNs, dropping columns with 30% or more NaNs, amongst other things.

Here's the start of my attempt at a custom Transformer:

class NanHandler(BaseEstimator, TransformerMixin):
    def __init__(self, target_col, row_threshold=0.7, col_threshold=0.7):
        self.target_col = target_col
        self.row_threshold = row_threshold
        self.col_threshold = col_threshold
    def transform(self, X):
        # drop rows and columns with >= 30% NaN values
    def fit(self, *_):
        return self

However, I want to use this Transformer with k-fold cross-validation. I'm concerned that if I do 3-fold cross-validation, it's unlikely (but possible) that I run into the following situation:

Train on folds 1 and 2, test on 3

Train on folds 2 and 3, test on 1

Train on folds 1 and 3, test on 2

Folds 1 and 2 combined may have over 30% Nans in a specific column (call it colA). So my NanHandler will drop this column before training. However, folds 2 and 3 combined may have less than 30% NaNs and so it won't drop colA, resulting my model being trained on different columns than the first pass.

1) How should I handle this situation?

2) Is this also a problem if I want to drop rows that have 30% ore more NaN values (in that I'll train on a different number of rows during k-fold cross-validation)?

Thanks!

Upvotes: 0

Views: 1352

Answers (1)

sinapan
sinapan

Reputation: 1000

The figure 30% is a little ambiguous to me. 30% of your entire dataset or 30% in each fold? For example, if you have a dataset with 90 samples and you break it up to 3 folds of 30. would you want 70% of cols and rows in a fold of 30 points to be present? (I'm going to assume that this is the case)

Then perhaps the following could work:

  1. Clear your entire dataset of all features and samples that have any missing values(Nan) and create a pool of data points that have at least one Nan.
  2. Then build your folds.
  3. Now, based on your number of features and examples you can resample points from your pool of points with Nan and add it back to each of your folds.

I hope this helps.

Upvotes: 2

Related Questions