venkysmarty
venkysmarty

Reputation: 11431

get_dummies usage in pandas

I am reading a book on Introduction to machine learning using Python. Here authors described as below Let’s say for the workclass feature we have possible values of "Government Employee", "Private Employee", "Self Employed" and "Self Employed Incorpora ted".

print("Original features:\n", list(data.columns), "\n")

data_dummies = pd.get_dummies(data)

print("Features after get_dummies:\n", list(data_dummies.columns))

Original features:
['age', 'workclass']

Features after get_dummies:
['age', 'workclass_ ?', 'workclass_ Government Employee', 'workclass_Private Employee', 'workclass_Self Employed', 'workclass_Self Employed Incorporated']

My question is what is new column workclass_ ?

Upvotes: 2

Views: 522

Answers (1)

jezrael
jezrael

Reputation: 862481

It is created with string values of column workclass:

data = pd.DataFrame({'age':[1,1,1,2,1,1],
                   'workclass':['Government Employee','Private Employee','Self Employed','Self Employed Incorpora ted','Self Employed Incorpora ted','?']})

print (data)
   age                    workclass
0    1          Government Employee
1    1             Private Employee
2    1                Self Employed
3    2  Self Employed Incorpora ted
4    1  Self Employed Incorpora ted
5    1                            ?

data_dummies = pd.get_dummies(data)
print (data_dummies)
   age  workclass_?  workclass_Government Employee  \
0    1            0                              1   
1    1            0                              0   
2    1            0                              0   
3    2            0                              0   
4    1            0                              0   
5    1            1                              0   

   workclass_Private Employee  workclass_Self Employed  \
0                           0                        0   
1                           1                        0   
2                           0                        1   
3                           0                        0   
4                           0                        0   
5                           0                        0   

   workclass_Self Employed Incorpora ted  
0                                      0  
1                                      0  
2                                      0  
3                                      1  
4                                      1  
5                                      0  

And if have multiple columns with same values this prefix is really helpfull:

data = pd.DataFrame({'age':[1,1,3],
                   'workclass':['Government Employee','Private Employee','?'],
                   'workclass1':['Government Employee','Private Employee','Self Employed']})

print (data)
   age            workclass           workclass1
0    1  Government Employee  Government Employee
1    1     Private Employee     Private Employee
2    3                    ?        Self Employed

data_dummies = pd.get_dummies(data)
print (data_dummies)
   age  workclass_?  workclass_Government Employee  \
0    1            0                              1   
1    1            0                              0   
2    3            1                              0   

   workclass_Private Employee  workclass1_Government Employee  \
0                           0                               1   
1                           1                               0   
2                           0                               0   

   workclass1_Private Employee  workclass1_Self Employed  
0                            0                         0  
1                            1                         0  
2                            0                         1  

If dont need it, is possible add parameters for overwrite it by empty space:

data_dummies = pd.get_dummies(data, prefix='', prefix_sep='')
print (data_dummies)
   age  ?  Government Employee  Private Employee  Government Employee  \
0    1  0                    1                 0                    1   
1    1  0                    0                 1                    0   
2    3  1                    0                 0                    0   

   Private Employee  Self Employed  
0                 0              0  
1                 1              0  
2                 0              1  

And then is possible groupby by columns and aggregate max for dummies per unique columns:

print (data_dummies.groupby(level=0, axis=1).max())
   ?  Government Employee  Private Employee  Self Employed  age
0  0                    1                 0              0    1
1  0                    0                 1              0    1
2  1                    0                 0              1    3

Upvotes: 3

Related Questions