Reputation: 9335
I'm doing some machine learning practice on Kaggle and I'm beginning to use the sklearn.pipeline.Pipeline
class to transform my data several times and then train a model on it.
I want to encapsulate several parts of pre-processing my data: dropping rows with 30% or more NaN
s, dropping columns with 30% or more NaN
s, amongst other things.
Here's the start of my attempt at a custom Transformer
:
class NanHandler(BaseEstimator, TransformerMixin):
def __init__(self, target_col, row_threshold=0.7, col_threshold=0.7):
self.target_col = target_col
self.row_threshold = row_threshold
self.col_threshold = col_threshold
def transform(self, X):
# drop rows and columns with >= 30% NaN values
def fit(self, *_):
return self
However, I want to use this Transformer
with k-fold cross-validation. I'm concerned that if I do 3-fold cross-validation, it's unlikely (but possible) that I run into the following situation:
Train on folds 1 and 2, test on 3
Train on folds 2 and 3, test on 1
Train on folds 1 and 3, test on 2
Folds 1 and 2 combined may have over 30% Nan
s in a specific column (call it colA
). So my NanHandler
will drop this column before training. However, folds 2 and 3 combined may have less than 30% NaN
s and so it won't drop colA
, resulting my model being trained on different columns than the first pass.
1) How should I handle this situation?
2) Is this also a problem if I want to drop rows that have 30% ore more NaN values (in that I'll train on a different number of rows during k-fold cross-validation)?
Thanks!
Upvotes: 0
Views: 1352
Reputation: 1000
The figure 30% is a little ambiguous to me. 30% of your entire dataset or 30% in each fold? For example, if you have a dataset with 90 samples and you break it up to 3 folds of 30. would you want 70% of cols and rows in a fold of 30 points to be present? (I'm going to assume that this is the case)
Then perhaps the following could work:
Nan
) and create a pool of data points that have at least one Nan
.Nan
and add it back to each of your folds.I hope this helps.
Upvotes: 2