Reputation: 9345
I have a training set that I'm using to build some machine learning models and I need to set up some code to predict on a test set (that I don't have access to).
For instance, if I have a DataFrame, train
:
car
0 Audi
1 BMW
2 Mazda
I can use pd.get_dummies
to get:
car_Audi car_BMW car_Mazda
0 1 0 0
1 0 1 0
2 0 0 1
Call this resulting DataFrame, train_encoded
Now, suppose my test
DataFrame looks like:
car
0 Mercedes
I can use:
pd.get_dummies(test).reindex(columns=train_encoded.columns)
to get:
car_Audi car_BMW car_Mazda
0 0 0 0
How can I treat NaN
s the same as an unseen value for my car
column? That is, if I encounter NaN
in my car
column in in test
, I want to get back:
car_Audi car_BMW car_Mazda
0 0 0 0
Thanks!
Upvotes: 1
Views: 1266
Reputation: 76297
If you generated a string filler
, that does not appear in df.car
, then,
slightly modifying Wen's suggestion in the comment (for the case that 'NAN'
is a string in df.car
), you can use
df.car.fillna(filler, inplace=True)
pd.get_dummies(test).reindex(columns=train_encoded.columns)
One way to define filler
, if you have access to all of df.car
in advance, is via
filler = '_' + ''.join(df.car.unique())
because it is at least longer by 1 than the longest string in it. Another way is by using a random string
filler = ''.join(random.choice(string.ascii_lowercase) for _ in range(10))
The probability you have such an item is less than len(df) / 26 ** 10
.
Upvotes: 1