Reputation:
I have this column
df["Pclass"].tail()
Pclass
2
1
3
1
3
I created dummies of the column
dummies = pd.get_dummies(df["Pclass"],prefix="Pclass")
df = pd.concat([df,dummies],axis=1)
result
df["Pclass_1"].tail()
Pclass_1 Pclass_2 Pclass_3
886 0 1 0
887 1 0 0
888 0 0 1
889 1 0 0
890 0 0 1
I don't quite get it. After which rules the numbers in the column are transformed into the 1s and 0s.
Upvotes: 0
Views: 1579
Reputation: 2110
Predictive models that depend on numeric inputs cannot directly handle open text fields or categorical attributes. Instead, these information-rich data need to be processed prior to presenting the information to a model. Tree-based and Naive Bayes models are exceptions; most models require that the predictors take numeric form.
Creating Dummy Variables for Unordered Categories is an approach for transforming categorical attributes to numerical. @Erfan has answered what dummy variables do. But take the case of encoding ordered attributes: An unordered predictor with C categories can be represented by C−1 binary dummy variables or a hashed version of binary dummy variables. These methods effectively present the categorical information to the models.
But now suppose that the C categories have a relative ordering. For example, consider a predictor that has the categories of “low”, “medium”, and “high.” Creating dummy attributes as done for Unordered Data would miss the information contained in the relative ordering.
For ordered data encoding:
Upvotes: 2
Reputation: 42916
pd.get_dummies
It basically pivots each unique value of the category's to it's own column and has a boolean flag (1
or 0
) to flag which categorical value was present on that row.
Let's look at a less abstract example:
df = pd.DataFrame({'sex':['male', 'female', 'unknown', 'female']})
sex
0 male
1 female
2 unknown
3 female
df.join(pd.get_dummies(df['sex'], prefix='sex'))
sex sex_female sex_male sex_unknown
0 male 0 1 0
1 female 1 0 0
2 unknown 0 0 1
3 female 1 0 0
As you can see, first row in our original column is male
and in our dummies column sex_male
we see that there's a flag 1
.
sex sex_female sex_male sex_unknown
0 male 0 1 0
Then on the second row, in our original column the value is female
and we see in our dummies column sex_female
has flag 1
:
sex sex_female sex_male sex_unknown
1 female 1 0 0
And so on.
What's also important to remember is that when you apply pd.get_dummies
:
amount of new dummie columns = amount of unique values in original caterogical column
In machine learning terms, we call this one-hot encoding
With scikit-learn
it would look as followed:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoder.fit_transform(df['sex'].to_numpy().reshape(-1,1)).toarray()
array([[0., 1., 0.],
[1., 0., 0.],
[0., 0., 1.],
[1., 0., 0.]])
Upvotes: 4
Reputation: 2182
It makes a dummy column for each value that appeared in the original column, and then for each row puts a 1 if that row had the value corresponding to the dummy column and a 0 otherwise.
The row 886 had a 2 in column Pclass, so that is converted to a 1 in column Pclass_2 and a 0 in all other dummy columns.
Row 887 had a 1 in column Pclass, so that is converted to a 1 in column Pclass_1and a 0 in all other dummy columns.
Upvotes: 1