THill3
THill3

Reputation: 259

How to use GridSearchCV to compare imputer methods?

I'm working on preprocessing the Titanic data set in order to run it through some regressions. It is the case that the "Age" column in the train and test sets is only populated for around 80% of the rows in each set.

Rather than just eliminate the rows that don't have an "Age" I'd like to use the SimpleImputer (from sklearn.impute import SimpleImputer) to fill in the missing values in those columns.

SimpleImputer has three options for the 'method' parameter that work with numeric data. These are mean, median, and most frequent (mode). (There's also the option to use a custom value, but because I'm trying to avoid "binning" the values I don't want to use this option.)

At its most basic, my approach would involve manually setting up the required datasets. I'd have to run one of each kind of imputer (imputer = SimpleImputer(strategy="xxxxxx") where xxxxxx = 'mean', 'median', or 'most frequent') on each of the train and test datasets and then end up with six different datasets that I'd then have to feed through my RandomForestRegressor one at a time.

I know that GridSearchCV can be used to exhaustively compare various combinations of parameter values in a regressor, so I'm wondering if anyone knows a way to use it or something similar to run through the various 'method' options of the imputer?

I'm thinking something along the lines of the following pseduocode -

param_grid = [
    {'method': ['mean','median', 'most frequent']},
]

forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv = 5, scoring = 'neg_mean_squared_error')

grid_search.fit(titanic_features[method], titanic_values[method])

Is there a clean way to compare options like this?

Is there a better way to compare the three options than to build all six data sets, run them through the RF regressor and see what comes out?

Upvotes: 1

Views: 1104

Answers (1)

mujjiga
mujjiga

Reputation: 16856

Sklearn Pipeline are exactly meant for this. You have to create a pipeline with imputer component preceding the regressor. You can the then use grid search parameter grid with __ to pass the component specific parameters.

Sample code (documented inline)

# Sample/synthetic data shape 1000 X 2
X = np.random.randn(1000,2)
y = 1.5*X[:,0]+3.2*X[:, 1]+2.4

# Randomly make 200 data points in each axis as nan's
X[np.random.randint(0,1000, 200), 0] = np.nan
X[np.random.randint(0,1000, 200), 1] = np.nan

# Simple pipeline which has an imputer followed by regressor
pipe = Pipeline(steps=[('impute', SimpleImputer(missing_values=np.nan)),
                       ('regressor', RandomForestRegressor())])

# 3 different imputers and 2 different regressors 
# a total of 6 different parameter combination will be searched
param_grid = {
        'impute__strategy': ["mean", "median", "most_frequent"],
        'regressor__max_depth': [2,3]
        }

# Run girdsearch
search = GridSearchCV(pipe, param_grid)
search.fit(X, y)

print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)

Sample output:

Best parameter (CV score=0.730):
{'impute__strategy': 'median', 'regressor__max_depth': 3}

So with GridSearchCV we are able to find that the best impute strategy for our sample data is median with combination if max_dept of 3.

You can keep extending the pipeline with other components.

Upvotes: 3

Related Questions