How to use GridSearchCV to compare imputer methods?

Question

I'm working on preprocessing the Titanic data set in order to run it through some regressions. It is the case that the "Age" column in the train and test sets is only populated for around 80% of the rows in each set.

Rather than just eliminate the rows that don't have an "Age" I'd like to use the SimpleImputer (from sklearn.impute import SimpleImputer) to fill in the missing values in those columns.

SimpleImputer has three options for the 'method' parameter that work with numeric data. These are mean, median, and most frequent (mode). (There's also the option to use a custom value, but because I'm trying to avoid "binning" the values I don't want to use this option.)

At its most basic, my approach would involve manually setting up the required datasets. I'd have to run one of each kind of imputer (imputer = SimpleImputer(strategy="xxxxxx") where xxxxxx = 'mean', 'median', or 'most frequent') on each of the train and test datasets and then end up with six different datasets that I'd then have to feed through my RandomForestRegressor one at a time.

I know that GridSearchCV can be used to exhaustively compare various combinations of parameter values in a regressor, so I'm wondering if anyone knows a way to use it or something similar to run through the various 'method' options of the imputer?

I'm thinking something along the lines of the following pseduocode -

param_grid = [
    {'method': ['mean','median', 'most frequent']},
]

forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv = 5, scoring = 'neg_mean_squared_error')

grid_search.fit(titanic_features[method], titanic_values[method])

Is there a clean way to compare options like this?

Is there a better way to compare the three options than to build all six data sets, run them through the RF regressor and see what comes out?

How to use GridSearchCV to compare imputer methods?

Answers (1)

Sample code (documented inline)

Related Questions