Reputation: 379
I am trying to apply regression (with XGBRegressor) to the following dataset containing 3 categorical variables.
X_data
severity -> values S1,S2,S3
priority -> values P1,P2,P3
cluster -> values a,b,c,d
y_data the labels to predict are numerical values
In order to convert all 3 columns to categorical I use:
pd.get_dummies(X_data['thecolumn'],drop_first =True)
After converting all of them I end with 7 new columns (considering that im dropping always the first column). When applying the algorithm, could a column from priority or cluster be misinterpreted as the third column of severity? Maybe I dont understand the concept but I can't see how the reference is kept and i'm afraid i'm not doing it right.
Upvotes: 1
Views: 1960
Reputation: 13401
Nope. The column from priority or cluster won't be misinterpreted as the third column of severity.
Here's answer to how reference is kept:
in pandas.get_dummies
there is a parameter i.e. drop_first
allows you whether to keep or remove the reference (whether to keep k or k-1 dummies out of k categorical levels).
Please note drop_first = False
meaning that the reference is not dropped and k dummies created out of k categorical levels! You set drop_first = True
, then it will drop the reference column after encoding.
Here's link to one hot encoding.
As in your case severity
has 3 categories S1, S2 and S3.
After creating dummies one of these categories will always be 1 and others 0.
for s1 it will be [1,0,0], s2 will be [0,1,0] and s3 will be [0,0,1]
Now if you drop the column for category s1.
The values will be [0,0] if severity is S1
[1,0] if severity is S2
[0,1] if severity is S3.
So there is no information loss here and your model has one less column to deal with.
That's why it is always recommended to keep drop_first
parameter as True
.
Edit :
After applying the dummies you will get columns like:
severity_S1 severity_S2 severity_S3
1 0 0 # when value is S1
0 1 0 # when value is S2
0 0 1 # when value is S3
pandas.get_dummies()
drops the 1st column after creating the above references.
So in your data will be like below:
severity_S2 severity_S3
0 0 # when value is S1
1 0 # when value is S2
0 1 # when value is S3
For all there variables your final data will look like below: I'm using short column names due to space issue:
s2 s3 p2 p3 B C D
0 0 1 0 1 0 0 # For row with S1, P2 and B
0 1 0 1 0 1 0 # For row with S3, P3 and C
1 0 0 0 0 0 1 # For row with S2, P1 and D
1 0 0 0 0 0 0 # For row with S2, P1 and A
Upvotes: 1