Matt
Matt

Reputation: 85

Packaging MultiLabelBinarizer into scikit-learn Pipeline for inference on new data

I'm building a multilabel classifier to predict labels based on a text field. For example, predicting genres based on movie title. I'd like to use MultiLabelBinarizer() to binarize a column containing all applicable genre labels. For example, ['action','comedy','drama'] gets split into three columns with 0/1 values.

The reason I'm using MultiLabelBinarizer() is so that I can use the built-in inverse_transform() function to turn the output array (e.g. array([0, 0, 1, 0, 1]) directly into user-friendly text output (['action','drama']).

The classifier works, but I'm having issues predicting on new data. I can't find a way to integrate the MultiLabelBinarizer() into my Pipeline so that it can be saved and re-loaded for inference on new data. One solution is to save it as a pickle object separately and load it back each time, but I'd like to avoid having this dependency in production.

I know that this is similar to the tf-idf vector I've built into my Pipeline, but different in the sense that it's applied to the target column (genre labels) instead of my independent variable (the text comment). Here's my code for training the multilabel SVM:

def svm_train(df):  
  mlb = MultiLabelBinarizer()
  y = mlb.fit_transform(df['Genres'])

  with mlflow.start_run():
    x_train, x_test, y_train, y_test = train_test_split(df['Movie Title'], y, test_size=0.3)

    # Instantiate TF-IDF Vectorizer and SVM Model
    tfidf_vect = TfidfVectorizer()
    mdl = OneVsRestClassifier(LinearSVC(loss='hinge'))
    svm_pipeline = Pipeline([('tfidf', tfidf_vect), ('clf', mdl)])

    svm_pipeline.fit(x_train, y_train)
    prediction = svm_pipeline.predict(x_test)

    report = classification_report(y_test, prediction, target_names=mlb.classes_)

    mlflow.sklearn.log_model(svm_pipeline, "Multilabel Classifier")
    mlflow.log_artifact(mlb, "MLB")

  return(report)

svm_train(df)

Inference consists of re-loading the saved model from MLflow (same as loading back in a pickle file) in a separate Databricks notebook and predicting using the Pipeline:

def predict_labels(new_data):
  model_uri = '...MLflow path...'
  model = mlflow.sklearn.load_model(model_uri)
  predictions = model.predict(new_data)
  # If I can't package the MultiLabelBinarizer() into the Pipeline, this 
  # is where I'd have to load the pickle object mlb
  # so that I can inverse_transform()
  return mlb.inverse_transform(predictions)

new_data = ['Some movie title']
predict_labels(new_data)

['action','comedy']

Here's all of the libraries I'm using:

import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import glob, os
from pyspark.sql import DataFrame
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score

Upvotes: 4

Views: 1761

Answers (1)

smurching
smurching

Reputation: 486

For your use case, you may want to consider using MLflow's functionality for persisting custom models. As per the docs:

While MLflow’s built-in model persistence utilities are convenient for packaging models from various popular ML libraries in MLflow Model format, they do not cover every use case. For example, you may want to use a model from an ML library that is not explicitly supported by MLflow’s built-in flavors. Alternatively, you may want to package custom inference code and data to create an MLflow Model. Fortunately, MLflow provides two solutions that can be used to accomplish these tasks: Custom Python Models and Custom Flavors.

In particular, you should be able to log the MultiLabelIndexer as an artifact along with the Sklearn model, in a manner analogous to the XGBoost model in the linked example, and then load it back at prediction time, something like:

# Save sklearn model & multilabel indexer to paths on the local filesystem
sklearn_model_path = "some/local/path"
labelindexer_path = "another/local/path"
# ... save your models objects here to sklearn_model_path and labelindexer_path

# Define the custom model class
import mlflow.pyfunc
class SklearnWrapper(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        import pickle, mlflow
        with open(context["indexer_path"], 'rb') as handle:
            self.indexer = pickle.load(handle)
        self.pipeline = mlflow.sklearn.load_model("pipeline_path")

    def predict(self, context, model_input):
        pipeline_preds = self.pipeline.predict(model_input)
        return self.indexer.inverse_transform(pipeline_preds)

# Create a Conda environment for the new MLflow Model that contains the XGBoost library
# as a dependency, as well as the required CloudPickle library
import cloudpickle
import sklearn
conda_env = {
    'channels': ['defaults'],
    'dependencies': [
      'sklearn={}'.format(sklearn.__version__),
      'cloudpickle={}'.format(cloudpickle.__version__),
    ],
    'name': 'sklearn_env'
}

# Save the MLflow Model
artifacts = {
    "pipeline_path": sklearn_model_path,
    "indexer_path": labelindexer_path,
}
mlflow_pyfunc_model_path = "sklearn_mlflow_pyfunc"
mlflow.pyfunc.save_model(
        path=mlflow_pyfunc_model_path, python_model=XGBWrapper(), artifacts=artifacts,
        conda_env=conda_env)

# Load the model in `python_function` format
loaded_model = mlflow.pyfunc.load_model(mlflow_pyfunc_model_path)
# Predict on a pandas DataFrame
import pandas as pd
loaded_model.predict(pd.DataFrame(...))

Note that our custom model still loads back the MultiLabelIndexer, but MLflow will persist the indexer along with your pipeline & custom model logic, so that you can treat the model as a single coherent unit for production deployment.

Upvotes: 3

Related Questions