Reputation: 11431
I am reading a book on Introduction to machine learning using Python. Here authors described as below Let’s say for the workclass feature we have possible values of "Government Employee", "Private Employee", "Self Employed" and "Self Employed Incorpora ted".
print("Original features:\n", list(data.columns), "\n")
data_dummies = pd.get_dummies(data)
print("Features after get_dummies:\n", list(data_dummies.columns))
Original features:
['age', 'workclass']
Features after get_dummies:
['age', 'workclass_ ?', 'workclass_ Government Employee', 'workclass_Private Employee', 'workclass_Self Employed', 'workclass_Self Employed Incorporated']
My question is what is new column workclass_ ?
Upvotes: 2
Views: 522
Reputation: 862481
It is created with string values of column workclass
:
data = pd.DataFrame({'age':[1,1,1,2,1,1],
'workclass':['Government Employee','Private Employee','Self Employed','Self Employed Incorpora ted','Self Employed Incorpora ted','?']})
print (data)
age workclass
0 1 Government Employee
1 1 Private Employee
2 1 Self Employed
3 2 Self Employed Incorpora ted
4 1 Self Employed Incorpora ted
5 1 ?
data_dummies = pd.get_dummies(data)
print (data_dummies)
age workclass_? workclass_Government Employee \
0 1 0 1
1 1 0 0
2 1 0 0
3 2 0 0
4 1 0 0
5 1 1 0
workclass_Private Employee workclass_Self Employed \
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
5 0 0
workclass_Self Employed Incorpora ted
0 0
1 0
2 0
3 1
4 1
5 0
And if have multiple columns with same values this prefix is really helpfull:
data = pd.DataFrame({'age':[1,1,3],
'workclass':['Government Employee','Private Employee','?'],
'workclass1':['Government Employee','Private Employee','Self Employed']})
print (data)
age workclass workclass1
0 1 Government Employee Government Employee
1 1 Private Employee Private Employee
2 3 ? Self Employed
data_dummies = pd.get_dummies(data)
print (data_dummies)
age workclass_? workclass_Government Employee \
0 1 0 1
1 1 0 0
2 3 1 0
workclass_Private Employee workclass1_Government Employee \
0 0 1
1 1 0
2 0 0
workclass1_Private Employee workclass1_Self Employed
0 0 0
1 1 0
2 0 1
If dont need it, is possible add parameters for overwrite it by empty space:
data_dummies = pd.get_dummies(data, prefix='', prefix_sep='')
print (data_dummies)
age ? Government Employee Private Employee Government Employee \
0 1 0 1 0 1
1 1 0 0 1 0
2 3 1 0 0 0
Private Employee Self Employed
0 0 0
1 1 0
2 0 1
And then is possible groupby
by columns and aggregate max
for dummies per unique columns:
print (data_dummies.groupby(level=0, axis=1).max())
? Government Employee Private Employee Self Employed age
0 0 1 0 0 1
1 0 0 1 0 1
2 1 0 0 1 3
Upvotes: 3