Packaging MultiLabelBinarizer into scikit-learn Pipeline for inference on new data

Question

I'm building a multilabel classifier to predict labels based on a text field. For example, predicting genres based on movie title. I'd like to use MultiLabelBinarizer() to binarize a column containing all applicable genre labels. For example, ['action','comedy','drama'] gets split into three columns with 0/1 values.

The reason I'm using MultiLabelBinarizer() is so that I can use the built-in inverse_transform() function to turn the output array (e.g. array([0, 0, 1, 0, 1]) directly into user-friendly text output (['action','drama']).

The classifier works, but I'm having issues predicting on new data. I can't find a way to integrate the MultiLabelBinarizer() into my Pipeline so that it can be saved and re-loaded for inference on new data. One solution is to save it as a pickle object separately and load it back each time, but I'd like to avoid having this dependency in production.

I know that this is similar to the tf-idf vector I've built into my Pipeline, but different in the sense that it's applied to the target column (genre labels) instead of my independent variable (the text comment). Here's my code for training the multilabel SVM:

def svm_train(df):  
  mlb = MultiLabelBinarizer()
  y = mlb.fit_transform(df['Genres'])

  with mlflow.start_run():
    x_train, x_test, y_train, y_test = train_test_split(df['Movie Title'], y, test_size=0.3)

    # Instantiate TF-IDF Vectorizer and SVM Model
    tfidf_vect = TfidfVectorizer()
    mdl = OneVsRestClassifier(LinearSVC(loss='hinge'))
    svm_pipeline = Pipeline([('tfidf', tfidf_vect), ('clf', mdl)])

    svm_pipeline.fit(x_train, y_train)
    prediction = svm_pipeline.predict(x_test)

    report = classification_report(y_test, prediction, target_names=mlb.classes_)

    mlflow.sklearn.log_model(svm_pipeline, "Multilabel Classifier")
    mlflow.log_artifact(mlb, "MLB")

  return(report)

svm_train(df)

Inference consists of re-loading the saved model from MLflow (same as loading back in a pickle file) in a separate Databricks notebook and predicting using the Pipeline:

def predict_labels(new_data):
  model_uri = '...MLflow path...'
  model = mlflow.sklearn.load_model(model_uri)
  predictions = model.predict(new_data)
  # If I can't package the MultiLabelBinarizer() into the Pipeline, this 
  # is where I'd have to load the pickle object mlb
  # so that I can inverse_transform()
  return mlb.inverse_transform(predictions)

new_data = ['Some movie title']
predict_labels(new_data)

['action','comedy']

Here's all of the libraries I'm using:

import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import glob, os
from pyspark.sql import DataFrame
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn import svm
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score

Packaging MultiLabelBinarizer into scikit-learn Pipeline for inference on new data

Answers (1)

Related Questions