Regressor
Regressor

Reputation: 1973

How to apply one hot encoding on unseen future data in sklearn

I am working with Titanic data as a sample set and I have come across a use case where I want to do one hot encoding during training phase and then apply my model. After this is done, I am planning to store the model so that I can load the model back and score the unseen dataset. The plan is have 2 .py files. One is train.py that will load the data, do feature engineering, apply logistic model and then save the model to disk. Second file is score.py. In score.py , I want to first take an entire unseen dataset, load the model from disk and then score that data to generate predictions. The problem is that in score.py I will have to transform the raw unseen data to one-hot encoded columns before generating predictions.

Here is some code for train.py

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


data_set = data[['Pclass','Sex','Age','Fare','SibSp','Cabin']]
one_hot_encoded_training_predictors = pd.get_dummies(data_set)
one_hot_encoded_training_predictors.head()
X = one_hot_encoded_training_predictors
y = data['Survived']

#Train Test split---75 25 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
logreg = LogisticRegression() 
logreg.fit(X_train, y_train)


##predicting test accuracy
y_pred = logreg.predict(X_test) #predicting the values

# Save model code here

logreg.save(..)

My score.py would look like

import statements
unseen_data = pd.read_csv(..) # this is raw unseen data

model.load(..)
model.predict(unseen_data)

Now imagine I have an unseen set which is never seen by the model. I can load the trained model using logreg.load(..) but the problem I am facing is, how do I first perform the one hot encoding on my raw unseen features? Can I also save the one hot encoding objects to be re-used on unseen set? I am new to Machine Learning and I might be missing something very simple but that is the issue I need to resolve.

Upvotes: 1

Views: 2189

Answers (1)

Amine Benatmane
Amine Benatmane

Reputation: 1261

If you you use OneHotEncoder, you can handle unknown categories by setting up handle_unknown parameter to "ignore". When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
...

Upvotes: 1

Related Questions