Battalgazi
Battalgazi

Reputation: 379

Convert different categorical variables to dummy variables

I am trying to apply regression (with XGBRegressor) to the following dataset containing 3 categorical variables.

X_data

severity -> values S1,S2,S3
priority -> values P1,P2,P3
cluster -> values a,b,c,d

y_data the labels to predict are numerical values

In order to convert all 3 columns to categorical I use:

pd.get_dummies(X_data['thecolumn'],drop_first =True)

After converting all of them I end with 7 new columns (considering that im dropping always the first column). When applying the algorithm, could a column from priority or cluster be misinterpreted as the third column of severity? Maybe I dont understand the concept but I can't see how the reference is kept and i'm afraid i'm not doing it right.

Upvotes: 1

Views: 1960

Answers (1)

Sociopath
Sociopath

Reputation: 13401

Nope. The column from priority or cluster won't be misinterpreted as the third column of severity.

Here's answer to how reference is kept:

in pandas.get_dummies there is a parameter i.e. drop_first allows you whether to keep or remove the reference (whether to keep k or k-1 dummies out of k categorical levels).

Please note drop_first = False meaning that the reference is not dropped and k dummies created out of k categorical levels! You set drop_first = True, then it will drop the reference column after encoding.

Here's link to one hot encoding.

As in your case severity has 3 categories S1, S2 and S3. After creating dummies one of these categories will always be 1 and others 0.

for s1 it will be [1,0,0], s2 will be [0,1,0] and s3 will be [0,0,1]

Now if you drop the column for category s1.

The values will be [0,0] if severity is S1

[1,0] if severity is S2

[0,1] if severity is S3.

So there is no information loss here and your model has one less column to deal with. That's why it is always recommended to keep drop_first parameter as True.

Edit :

After applying the dummies you will get columns like:

severity_S1   severity_S2   severity_S3  

  1              0              0                  # when value is S1
  0              1              0                  # when value is S2  
  0              0              1                  # when value is S3

pandas.get_dummies() drops the 1st column after creating the above references. So in your data will be like below:

 severity_S2   severity_S3

   0              0                  # when value is S1
   1              0                  # when value is S2  
   0              1                  # when value is S3

For all there variables your final data will look like below: I'm using short column names due to space issue:

s2  s3  p2  p3  B  C  D
0   0   1   0   1  0  0     # For row with S1, P2 and B
0   1   0   1   0  1  0     # For row with S3, P3 and C
1   0   0   0   0  0  1     # For row with S2, P1 and D
1   0   0   0   0  0  0     # For row with S2, P1 and A

Upvotes: 1

Related Questions