Scikit-Learn Pipeline: How to Handle Preprocessing

Question

I'm doing some machine learning practice on Kaggle and I'm beginning to use the sklearn.pipeline.Pipeline class to transform my data several times and then train a model on it.

I want to encapsulate several parts of pre-processing my data: dropping rows with 30% or more NaNs, dropping columns with 30% or more NaNs, amongst other things.

Here's the start of my attempt at a custom Transformer:

class NanHandler(BaseEstimator, TransformerMixin):
    def __init__(self, target_col, row_threshold=0.7, col_threshold=0.7):
        self.target_col = target_col
        self.row_threshold = row_threshold
        self.col_threshold = col_threshold
    def transform(self, X):
        # drop rows and columns with >= 30% NaN values
    def fit(self, *_):
        return self

However, I want to use this Transformer with k-fold cross-validation. I'm concerned that if I do 3-fold cross-validation, it's unlikely (but possible) that I run into the following situation:

Train on folds 1 and 2, test on 3

Train on folds 2 and 3, test on 1

Train on folds 1 and 3, test on 2

Folds 1 and 2 combined may have over 30% Nans in a specific column (call it colA). So my NanHandler will drop this column before training. However, folds 2 and 3 combined may have less than 30% NaNs and so it won't drop colA, resulting my model being trained on different columns than the first pass.

1) How should I handle this situation?

2) Is this also a problem if I want to drop rows that have 30% ore more NaN values (in that I'll train on a different number of rows during k-fold cross-validation)?

Thanks!

Scikit-Learn Pipeline: How to Handle Preprocessing

Answers (1)

Related Questions