Reputation: 2527
I have a dataset of nominal and numerical features. I want to be able to represent this dataset entirely numerically if possible.
Ideally I would be able to do this for an n-ary nominal feature. I realize that in the binary case, one could represent the two nominal values with integers. However, when a nominal feature can have many permutations, how would this be possible, if at all?
Upvotes: 2
Views: 903
Reputation: 81
If you use pandas, you can use a function called .get_dummies()
on your nominal value column. This will turn the column of N
unique values into N
(or if you want N-1
, called drop_first
) new columns indicating with either a 1
or a 0
if a value is present.
Example:
s = pd.Series(list('abca'))
get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Upvotes: 0
Reputation: 77454
There are a number of techniques to "embed" categorical attributes as numbers.
For example, given a categorical variable that can take the values red
, green
and blue
, we can trivially encode this as three attributes isRed={0,1}
, isGreen={0,1}
and isBlue={0,1}
.
While this is popular, and will obviously "work", many people fall for the fallacy of assuming that afterwards numerical processing techniques will produce sensible results.
If you run e.g. k-means on a dataset encoded this way, the result will likely not be too meaningful afterwards. In particular, if you get a mean such as isRed=.3 isGreen=.2 isBlue=.5
- you cannot reasonably map this back to the original data. Worse, with some algorithms you may even get isRed=0 isGreen=0 isBlue=0
.
I suggest that you try to work on your actual data, and avoid encoding as much as possible. If you have a good tool, it will allow you to use mixed data types. Don't try to make everything a numerical vector. This mathematical view of data is quite limited and the data will not give you all the mathematical assumptions that you need to benefit from this view (e.g. metric spaces).
Upvotes: 2
Reputation: 610
Don't do this: I'm trying to encode certain nominal attributes as integers.
Except if there is only two permutations for a nominal feature. It is ok to use any different integers (for example 1 and 3) for each.
But if there is more than two permutations, integers can not be used. Lets say we assigned 1, 2 and 3 to three permutations. As we can see, there is higher relation between 1-2 and 2-3 than 1-3 because of differences.
Rather, use a separate binary feature for each value of each nominal attribute. Thus, the answer of your question: It is not possible/wisely.
Upvotes: 1