Reputation: 3885
I have created a ML model with Scikit-Learn and saved it. Now when I load the model, I have trouble with transformation and prediction. I have 4 features in DataFrame. First two features are textual, and other 2 are numerical. The result column is 1 or 0.
In order to train my model, I used ColumnTransformer
and CountVectorizer
for transformation and vectorization textual features. I specified NAMES of the columns that I want to transform/vectorize.
(columns text1 and text2). Numerical columns do not need to be vectorized so remainder='passthrough'
is fixing that.
Part of code that works:
features = df.iloc[:, :-1]
results = df.iloc[:, -1]
transformerVectoriser = ColumnTransformer(transformers=[('vector word 1', CountVectorizer(analyzer='word', ngram_range=(1, 1), max_features = 12000, stop_words = 'english'), 'text1'),
('vector phrase 3', CountVectorizer(analyzer='word', ngram_range=(3, 3), max_features = 2500, stop_words = 'english'), 'text2')],
remainder='passthrough') # Default is to drop untransformed columns, passthrough == leave columns as they are
x_train, x_test, y_train, y_test = train_test_split(features, results, test_size=0.3, random_state=0)
x_train = transformerVectoriser.fit_transform(x_train)
x_test = transformerVectoriser.transform(x_test)
model = clf.fit(x_train, y_train)
y_pred = model.predict(x_test)
filename = 'ml_model.sav'
pickle.dump(model, open(filename, 'wb'))
filename = 'ml_transformer.sav'
pickle.dump(transformerVectoriser, open(filename, 'wb'))
But when I want to load a model, and make prediction I get an error:
# LOADING MODEL
model = pickle.load(open('ml_model.sav','rb'))
vectorizer = pickle.load(open('ml_transformer.sav','rb'))
# MAKING PREDICTION
data_for_prediction = vectorizer.transform([data_for_prediction]) #ERROR
print(model.predict_proba(data_for_prediction))
I get the error:
ValueError: Specifying the columns using strings is only supported for pandas DataFrames
When I was training my model I used Pandas
dataframe, and when I wanted to make prediction I have just put the values in the list. So data_for_prediction
is list, that looks like this:
["text that should be vectorized with vectorizer that i created", "More texts that should be vectorized", 4, 7]
I think that that is the error, Because I used column names when I was using ColumnTransformer, but now when I want to make prediction, vectorizer do not know what to vectorize.
My final model and vectorizer should be used in an API, and api should only take JSON, so I do not want to convert JSON to DataFrame and pass it to the model.
Is there a way to fix this error without using pandas dataframe
in my final Flask APP.
Upvotes: 3
Views: 1321
Reputation: 2320
In the case you don't want to use pandas.DataFrame
in your REST API endpoint, just don't train your model with the DataFrame
but convert your data to a numpy
array first:
>>> df
TEXT_1 TEXT_2 NUM_1 NUM_2
0 This is the first text. The second text. 300.000 23.3
1 Here is the third text. And the fourth text. 2.334 29.0
>>> df.to_numpy()
array([['This is the first text.', 'The second text.', 300.0, 23.3],
['Here is the third text.', 'And the fourth text.', 2.334, 29.0]],
dtype=object)
Then, make changes in how you define the model. I'd suggest to combine preprocessing and predicting steps using sklearn.pipeline.Pipeline
into a single model like this:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
model = Pipeline(steps=[
('transformer', ColumnTransformer(
transformers=[
('TEXT_1', CountVectorizer(analyzer='word', stop_words='english'), 0),
('TEXT_2', CountVectorizer(analyzer='word', stop_words='english'), 1),
],
remainder='passthrough',
)),
('predictor', RandomForestClassifier()),
])
Note, here we are using indices instead of names to reference texts when defining transformers for the ColumnTransformer
instance. Once we've transformed the initial DataFrame
to a numpy
array, the TEXT_1
feature is located at 0
, and the TEXT_2
at 1
in a data row. Here is how you can use the model:
from joblib import dump, load
X = df.to_numpy()
model.fit(X, y)
dump(model, 'model.joblib')
...
model = load('model.joblib')
results = model.predict(data)
As a result, you don't have to convert your incoming data to the DataFrame
in order to make a prediction.
Upvotes: 1
Reputation: 511
The training data is a dataframe with the columns:
x_train.columns
the function vectorizer.transform()
wants data in the same format, so assuming that
data_f_p = ["text that should be vectorized", 4,7,0]
corresponds to the same four columns as x_train
you can turn it into a dataframe with
data_f_p = pd.DataFrame([data_f_p], columns=x_train.columns)
data_f_p = vectorizer.transform(data_f_p)
Upvotes: 2