Reputation: 2206
Suppose I have a data frame data
with strings that I want converted to indicators. I use pandas.get_dummies(data)
to convert this to a dataset that I can now use for building a model.
Now I have a single new observation that I want to run through my model. Obviously I can't use pandas.get_dummies(new_data)
because it doesn't contain all of the classes and won't make the same indicator matrices. Is there a good way to do this?
Upvotes: 30
Views: 10796
Reputation: 11
Seems you can take the advantage of type category
.
import pandas as pd
train = pd.DataFrame({'feature':['a', 'b', 'c', 'd']})
test = pd.DataFrame({'feature':['a']})
train['feature'] = train['feature'].astype('category')
dummies_type = train['feature'].dtype
test['feature'] = test['feature'].astype(dummies_type)
training data:
pd.get_dummies(train)
feature_a feature_b feature_c feature_d
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
testing data:
pd.get_dummies(test)
feature_a feature_b feature_c feature_d
1 0 0 0
new value of the feature:
test_oov = pd.DataFrame({'feature':['z']})
test_oov['feature'] = test_oov['feature'].astype(dummies_type)
pd.get_dummies(test_oov)
feature_a feature_b feature_c feature_d
0 0 0 0
Upvotes: 1
Reputation: 6752
Fetching out JAB's answer in order to use it for example in sklearn pipelines, this code may help you:
from sklearn.base import BaseEstimator, TransformerMixin
class GetDummies(BaseEstimator, TransformerMixin):
def __init__(self, dummy_columns):
self.columns = None
self.dummy_columns = dummy_columns
def fit(self, X, y=None):
self.columns = pd.get_dummies(X, columns=self.dummy_columns).columns
return self
def transform(self, X):
X_new = pd.get_dummies(X, columns=self.dummy_columns)
return X_new.reindex(columns=self.columns, fill_value=0)
Upvotes: 1
Reputation: 12801
you can create the dummies from the single new observation, and then reindex this frames columns using the columns from the original indicator matrix:
import pandas as pd
df = pd.DataFrame({'cat':['a','b','c','d'],'val':[1,2,5,10]})
df1 = pd.get_dummies(pd.DataFrame({'cat':['a'],'val':[1]}))
dummies_frame = pd.get_dummies(df)
df1.reindex(columns = dummies_frame.columns, fill_value=0)
returns:
val cat_a cat_b cat_c cat_d
0 1 1 0 0 0
Upvotes: 44