anon_swe
anon_swe

Reputation: 9345

Pandas: Treat NaN as Unseen Value in One-Hot Encoding

I have a training set that I'm using to build some machine learning models and I need to set up some code to predict on a test set (that I don't have access to).

For instance, if I have a DataFrame, train:

    car
0   Audi
1   BMW
2   Mazda

I can use pd.get_dummies to get:

   car_Audi car_BMW car_Mazda
0      1       0       0
1      0       1       0
2      0       0       1

Call this resulting DataFrame, train_encoded

Now, suppose my test DataFrame looks like:

    car
0   Mercedes

I can use:

pd.get_dummies(test).reindex(columns=train_encoded.columns)

to get:

   car_Audi car_BMW car_Mazda
0      0       0       0

How can I treat NaNs the same as an unseen value for my car column? That is, if I encounter NaN in my car column in in test, I want to get back:

   car_Audi car_BMW car_Mazda
0      0       0       0

Thanks!

Upvotes: 1

Views: 1266

Answers (1)

Ami Tavory
Ami Tavory

Reputation: 76297

If you generated a string filler, that does not appear in df.car, then, slightly modifying Wen's suggestion in the comment (for the case that 'NAN' is a string in df.car), you can use

df.car.fillna(filler, inplace=True) 
pd.get_dummies(test).reindex(columns=train_encoded.columns)

One way to define filler, if you have access to all of df.car in advance, is via

filler = '_' + ''.join(df.car.unique())

because it is at least longer by 1 than the longest string in it. Another way is by using a random string

filler = ''.join(random.choice(string.ascii_lowercase) for _ in range(10))

The probability you have such an item is less than len(df) / 26 ** 10.

Upvotes: 1

Related Questions