Reputation: 3607
It was recently brought to my attention that if you have a dataframe df
like this:
A B C
0 0 Boat 45
1 1 NaN 12
2 2 Cat 6
3 3 Moose 21
4 4 Boat 43
You can encode the categorical data automatically with pd.get_dummies
:
df1 = pd.get_dummies(df)
Which yields this:
A C B_Boat B_Cat B_Moose
0 0 45 1.0 0.0 0.0
1 1 12 0.0 0.0 0.0
2 2 6 0.0 1.0 0.0
3 3 21 0.0 0.0 1.0
4 4 43 1.0 0.0 0.0
I typically use LabelEncoder().fit_transform
for this sort of task before putting it in pd.get_dummies
, but if I can skip a few steps that'd be desirable.
Am I losing anything by simply using pd.get_dummies
on my entire dataframe to encode it?
Upvotes: 5
Views: 2481
Reputation: 40973
Yes, you can skip the use of LabelEncoder
if you only want to encode string features. On the other hand if you have a categorical column of integers (instead of strings) then pd.get_dummies
will leave as it is (see your A or C column for example). In that case you should use OneHotEncoder
. Ideally OneHotEncoder
would support both integer and strings but this is being worked on at the moment.
Upvotes: 7