Saving scikit-learn models in a for loop

Question

I'm running a bunch of models with scikit-learn to solve a classification problem.

Here is the code that should do all the running:

for model_name, classifier, param_grid, cv, cv_name in tqdm(zip(model_names, classifiers, param_grids, cvs, cv_names)):
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                     ('classifier', classifier)])

     train_and_score_model(model_name, pipeline, param_grid, cv=cv)

My question is, how can I retain the output of my train_and_score_model function? It returns a cv object, i.e. a model.

What I tried to do, but I don't think is right, is create a list cv_names = ['dm_cv', 'lr_cv', 'knn_cv', 'svm_cv', 'dt_cv', 'rf_cv', 'nb_cv'] and set each one as the for loop runs. That is the cv_name iterator in the for loop head.

I don't think that's right though, because wouldn't I be setting a string, instead of a variable? As in, what I should really have is cv_names = [dm_cv, lr_cv, knn_cv, svm_cv, dt_cv, rf_cv, nb_cv], but I don't think I can have a list like that.

Another way I thought of is saving each model in a dictionary, where the keys would be the elements of the list I outlined above. I don't know if I can have a model as a dictionary value though.

Here is the clunky, repetitive code I currently run to do what I want in the for-loop:

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                 ('classifier', classifier_dm)])
dm_cv = train_and_score_model('Dummy Model', pipeline, param_grid_dm)


pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                     ('classifier', classifier_lr)])
lr_cv = train_and_score_model('Logistic Regression', pipeline, param_grid_lr)


pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                     ('classifier', classifier_knn)])
knn_cv = train_and_score_model('K Nearest Neighbors', pipeline, param_grid_knn)


pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                     ('classifier', classifier_svm)])
svm_cv = train_and_score_model('Support Vector Machine', pipeline, param_grid_svm)


pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                     ('classifier', classifier_dt)])
dt_cv = train_and_score_model('Decision Tree', pipeline, param_grid_dt)


pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                     ('classifier', classifier_rf)])
rf_cv = train_and_score_model('Random Forest', pipeline, param_grid_rf)


pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                     ('classifier', classifier_nb)])
nb_cv = train_and_score_model('Naive Bayes', pipeline, param_grid_nb)

panktijk · Accepted Answer

You can create a dictionary with mappings of classifier names with their information i.e. objects and paramter grids:

models_list = {'Logistic Regression': (classifier_lr, param_grid_lr),
               'K Nearest Neighbours': (classifier_knn, param_grid_knn)}

Iterate through every key-value pair in the dictionary and build your pipelines:

model_cvs = {}
for model_name, model_info in models_list.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                       ('classifier', model_info[0])])
    model_cvs[model_name] = train_and_score_model(model_name, pipeline, model_info[1])

Saving scikit-learn models in a for loop

Answers (1)

Related Questions