taga
taga

Reputation: 3885

Unable to make prediction after loading sklearn model

I have created a ML model with Scikit-Learn and saved it. Now when I load the model, I have trouble with transformation and prediction. I have 4 features in DataFrame. First two features are textual, and other 2 are numerical. The result column is 1 or 0.

In order to train my model, I used ColumnTransformer and CountVectorizer for transformation and vectorization textual features. I specified NAMES of the columns that I want to transform/vectorize. (columns text1 and text2). Numerical columns do not need to be vectorized so remainder='passthrough' is fixing that.

Part of code that works:

features = df.iloc[:, :-1]
results = df.iloc[:, -1]

transformerVectoriser = ColumnTransformer(transformers=[('vector word 1', CountVectorizer(analyzer='word', ngram_range=(1, 1), max_features = 12000, stop_words = 'english'), 'text1'),
                                                       ('vector phrase 3', CountVectorizer(analyzer='word', ngram_range=(3, 3), max_features = 2500, stop_words = 'english'), 'text2')],
                                                      remainder='passthrough') # Default is to drop untransformed columns, passthrough == leave columns as they are

x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)

x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)


model = clf.fit(x_train, y_train)
y_pred = model.predict(x_test)

filename = 'ml_model.sav'
pickle.dump(model, open(filename, 'wb'))

filename = 'ml_transformer.sav'
pickle.dump(transformerVectoriser, open(filename, 'wb'))

But when I want to load a model, and make prediction I get an error:

# LOADING MODEL
model = pickle.load(open('ml_model.sav','rb'))
vectorizer = pickle.load(open('ml_transformer.sav','rb'))

# MAKING PREDICTION
data_for_prediction = vectorizer.transform([data_for_prediction]) #ERROR
print(model.predict_proba(data_for_prediction))

I get the error:

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

When I was training my model I used Pandas dataframe, and when I wanted to make prediction I have just put the values in the list. So data_for_prediction is list, that looks like this:

["text that should be vectorized with vectorizer that i created", "More texts that should be vectorized", 4, 7]

I think that that is the error, Because I used column names when I was using ColumnTransformer, but now when I want to make prediction, vectorizer do not know what to vectorize. My final model and vectorizer should be used in an API, and api should only take JSON, so I do not want to convert JSON to DataFrame and pass it to the model. Is there a way to fix this error without using pandas dataframe in my final Flask APP.

Upvotes: 3

Views: 1321

Answers (2)

constt
constt

Reputation: 2320

In the case you don't want to use pandas.DataFrame in your REST API endpoint, just don't train your model with the DataFrame but convert your data to a numpy array first:

>>> df
                    TEXT_1                TEXT_2    NUM_1  NUM_2
0  This is the first text.      The second text.  300.000   23.3
1  Here is the third text.  And the fourth text.    2.334   29.0
>>> df.to_numpy()
array([['This is the first text.', 'The second text.', 300.0, 23.3],
       ['Here is the third text.', 'And the fourth text.', 2.334, 29.0]],
      dtype=object)

Then, make changes in how you define the model. I'd suggest to combine preprocessing and predicting steps using sklearn.pipeline.Pipeline into a single model like this:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

model = Pipeline(steps=[
    ('transformer', ColumnTransformer(
        transformers=[
            ('TEXT_1', CountVectorizer(analyzer='word', stop_words='english'), 0),
            ('TEXT_2', CountVectorizer(analyzer='word', stop_words='english'), 1),
        ],
        remainder='passthrough',
    )),
    ('predictor', RandomForestClassifier()),
])

Note, here we are using indices instead of names to reference texts when defining transformers for the ColumnTransformer instance. Once we've transformed the initial DataFrame to a numpy array, the TEXT_1 feature is located at 0, and the TEXT_2 at 1 in a data row. Here is how you can use the model:

from joblib import dump, load

X = df.to_numpy()
model.fit(X, y)
dump(model, 'model.joblib')

...

model = load('model.joblib')
results = model.predict(data)

As a result, you don't have to convert your incoming data to the DataFrame in order to make a prediction.

Upvotes: 1

konstanze
konstanze

Reputation: 511

The training data is a dataframe with the columns:

x_train.columns

the function vectorizer.transform() wants data in the same format, so assuming that

data_f_p = ["text that should be vectorized", 4,7,0]

corresponds to the same four columns as x_train you can turn it into a dataframe with

data_f_p = pd.DataFrame([data_f_p], columns=x_train.columns)
data_f_p = vectorizer.transform(data_f_p)

Upvotes: 2

Related Questions