Akalanka Test
Akalanka Test

Reputation: 21

How to save one hot encoded model and predict new unencoded data using scikitlearn?

My dataset contains 3 categorical features and I used one hot encoding to change it to binary format and all went fine. But when I want to save that trained model and predict new raw data, the inputted is not encoded as I expected and result in error.

combined_df_raw2= pd.concat([train_x_raw,unknown_test_df])
combined_df2 = pd.get_dummies(combined_df_raw2, columns=nominal_cols, 
drop_first=True)

encoded_unknown_df = combined_df2[len(train_x_raw):]

classifier = DecisionTreeClassifier(random_state=17)
classifier.fit(train_x_raw, train_Y)

pred_y = classifier.predict(encoded_unknown_df)

#here I use joblib to save my model and load it again
joblib.dump(classifier, 'savedmodel')
imported_model = joblib.load('savedmodel')

#here I input unencoded raw data for predict and got error that cannot             
convert 'tcp' to float, means that it is not encoded 

imported_model.predict([0,'tcp','vmnet','REJ',0,0,0,23])   

ValueError: could not convert string to float: 'tcp'

Upvotes: 2

Views: 7785

Answers (3)

YoungSheldon
YoungSheldon

Reputation: 1195

The model is trained after encoding the categorical variable, hence, the input has to be given after applying 'onehot encoding' to respective variables. Example: one of the column is titeled as "Country" and you have three different values across the dataset viz. ['India', Israel', 'France'], now you have applied OneHotEncoding(Probably after LabelEncoder) on the country column, then you train your model, save it do whatever other stuff you want!

Now the question is, you get input error when you directly give input without changing it to the format on which the model was trained. Hence, we will always want to preprocess the input before we give it to model. The most common way in my knowledge is to use Pipeline.

steps = [('scaler', StandardScaler()), ('ohe', OneHotEncoder()),('SVM', 
        DecisionTreeClassifier())]
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps) # You need to save this pipeline via joblib
pipe.fit(X_train, y_train)

Incase, you don't want to use Pipeline, you can anyways use OneHotEncode on specific column/s and then use predict!

Upvotes: 4

codeblooded
codeblooded

Reputation: 350

Use fit() followed by transform(), that way you can pickle your one hot encoder after you have fit it.

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)

Then lets pickle your encoder, you could use other ways of persisting the encoder. Check out, https://scikit-learn.org/stable/modules/model_persistence.html

import pickle
with open('encoder.pickle', 'wb') as f:
    pickle.dump(enc, f)

Now when you have new data to predict, you must first go through the entire pre-processing pipeline you did for your training data. In this case the encoder. Let's load it back,

with open('encoder.pickle', 'rb') as f:
    enc = pickle.loads(f)

Once you have it loaded, you just need to transform the new data.

enc.transform(new_data)

To know more about pickle, https://docs.python.org/3/library/pickle.html

Upvotes: 3

agetareen
agetareen

Reputation: 1

@chintan then e.g for the upcoming raw data, if you convert the categorial variable having only one instance then it will make only one extra column, while before for the categorical column you had, vod be having like 500 columns. so it wont match again. take an example of currencies, one instance is coming have INR only, even if you do the encoding, it will convert it into a column, but before you have columns for all the curruncies in the world

Upvotes: 0

Related Questions