OAK
OAK

Reputation: 3166

python grouping similar categorical values

My dataset has one column with a large number of unique values (object type). I believe some are insignificant (if they are spare) and so I am looking to group the levels if they are beneath a certain defined threshold. I converted the column into categorical values with the label encoder module, then I want to combine those categories which have a count less than a specified threshold. So for this sample set that I prepared, if the total number (freq) of a certain class in the 'bin' column is equal or less than 2 than it will receive a new values under 'new_bins' column of 'o' instead. So bin ('c','d') will be altered in 'new_bins' to 'o'.

    id  |  bin | new_bins
======== =================
    1       a       a
    2       a       a
    3       b       b
    4       c       o
    5       b       b
    6       a       a
    7       b       b
    8       a       a
    9       c       o
    10      a       a
    11      d       o
    12      d       o

df.groupby(['bin'], sort=True).count())

This is one line of code I tried but it doesn't achieve what I want. I know this is kind of a fuzzy as I have no code. I thought this issue refers to bins but perhaps its called something else, I cannot seem to fined a similar example. In a kaggle competition it was referred to as combining levels. Perhaps just naming the term or phrase I should be looking for would also help.

Upvotes: 1

Views: 168

Answers (1)

Abbas
Abbas

Reputation: 4069

This shall help you:

In [127]: df
Out[127]:
   id bin new_bins
0   1   a        a
1   2   a        a
2   3   b        o
3   4   c        o
4   5   b        o
5   6   a        a
6   7   b        o
7   8   a        a
8   9   c        o
9  10   a        a

Group the items:

In [128]: dfg = df.groupby('bin').count()

In [129]: dfg
Out[129]:
     id  new_bins
bin
a     5         5
b     3         3
c     2         2

For selecting the items that meets the condition

In [130]: dfg[dfg['id'] > 2]
Out[130]:
     id  new_bins
bin
a     5         5
b     3         3

In [143]: val = dfg[dfg['id'] <= 2]

In [144]: val
Out[144]:
     id  new_bins
bin
c     2  MODIFIED

To change the values of a column that meets the condition

In [147]: df.loc[df['bin'] == val.index[0], 'new_bins'] = 'MOD'

In [148]: df
Out[148]:
   id bin new_bins
0   1   a        a
1   2   a        a
2   3   b        o
3   4   c      MOD
4   5   b        o
5   6   a        a
6   7   b        o
7   8   a        a
8   9   c      MOD
9  10   a        a

Upvotes: 1

Related Questions