Reputation: 111
I am unable to find a solution to find outliers in categorical data. My data consists of combinations of rows. I want to mark outliers that differ in certain combinations.
In the above question as specified, I cannot cluster the data as a nonoutlier data row and the outlier row consisting of the same frequency.
My data looks something like this:
c1 | c2 | c3 | c4 | |
---|---|---|---|---|
row1 | A | B | C | D |
row2 | A | B | C | D |
row3 | A | D | C | G |
row4 | NU | D | E | G |
row6 | NU | D | E | X |
Please suggest a valid logic to solve the issue. I also tried to distribute the data based on frequency but I'm unable to assign a threshold as I'm unable to find a value to consider the data as outliers. Providing a way to find thresholds also can help.
Upvotes: 3
Views: 4002
Reputation: 1761
According to the tags you assigned, I guess you want to perform one-hot encoding in a later step. In this case you can use sklearn
's OneHotEncoder and specify the min_frequency
parameter. If you specified the min_frequency
parameter, rare categorical values will be assigned 'infrequend_sklearn'
.
Upvotes: 0
Reputation: 76
There are no outlier detection methods for categorical data. The notion means nothing in this case. You might think like that:
You have a sample of 10 with 9 females and 1 male. You might think the male is the outlier it's just the composition of your sample, not an outlier.
For an outlier to exist there must be a measure of distance between the items. Have a look at this for more information.
Please suggest a valid logic to solve the issue. I Also tried to distribute the data based on frquency but i'm unable to assign a thresold as im unable to find a value to consider the data as outliers.Providing a way to find thresold also can help.
A solution could be to just value_counts
your column so then you have the frequency of each element.
Upvotes: 1