Reputation: 259
I'm working on preprocessing the Titanic data set in order to run it through some regressions. It is the case that the "Age" column in the train and test sets is only populated for around 80% of the rows in each set.
Rather than just eliminate the rows that don't have an "Age" I'd like to use the SimpleImputer (from sklearn.impute import SimpleImputer) to fill in the missing values in those columns.
SimpleImputer has three options for the 'method' parameter that work with numeric data. These are mean, median, and most frequent (mode). (There's also the option to use a custom value, but because I'm trying to avoid "binning" the values I don't want to use this option.)
At its most basic, my approach would involve manually setting up the required datasets. I'd have to run one of each kind of imputer (imputer = SimpleImputer(strategy="xxxxxx") where xxxxxx = 'mean', 'median', or 'most frequent') on each of the train and test datasets and then end up with six different datasets that I'd then have to feed through my RandomForestRegressor one at a time.
I know that GridSearchCV can be used to exhaustively compare various combinations of parameter values in a regressor, so I'm wondering if anyone knows a way to use it or something similar to run through the various 'method' options of the imputer?
I'm thinking something along the lines of the following pseduocode -
param_grid = [
{'method': ['mean','median', 'most frequent']},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv = 5, scoring = 'neg_mean_squared_error')
grid_search.fit(titanic_features[method], titanic_values[method])
Is there a clean way to compare options like this?
Is there a better way to compare the three options than to build all six data sets, run them through the RF regressor and see what comes out?
Upvotes: 1
Views: 1104
Reputation: 16856
Sklearn Pipeline are exactly meant for this. You have to create a pipeline with imputer component preceding the regressor. You can the then use grid search parameter grid with __
to pass the component specific parameters.
# Sample/synthetic data shape 1000 X 2
X = np.random.randn(1000,2)
y = 1.5*X[:,0]+3.2*X[:, 1]+2.4
# Randomly make 200 data points in each axis as nan's
X[np.random.randint(0,1000, 200), 0] = np.nan
X[np.random.randint(0,1000, 200), 1] = np.nan
# Simple pipeline which has an imputer followed by regressor
pipe = Pipeline(steps=[('impute', SimpleImputer(missing_values=np.nan)),
('regressor', RandomForestRegressor())])
# 3 different imputers and 2 different regressors
# a total of 6 different parameter combination will be searched
param_grid = {
'impute__strategy': ["mean", "median", "most_frequent"],
'regressor__max_depth': [2,3]
}
# Run girdsearch
search = GridSearchCV(pipe, param_grid)
search.fit(X, y)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
Sample output:
Best parameter (CV score=0.730):
{'impute__strategy': 'median', 'regressor__max_depth': 3}
So with GridSearchCV
we are able to find that the best impute strategy for our sample data is median
with combination if max_dept
of 3.
You can keep extending the pipeline with other components.
Upvotes: 3