how to one-hot-encode values in a columns，while treating some values as one same category

Question

I want to one-hot-encode column in a Pandas dataframe. Some values in that column have low occurrence rate thus I would like to treat them as the same category. Is a way to do this by using one-hot-encoder or get_dummies methods? One way I come up with is to replace those values with a dict before encoding. Any suggestion would be highly appreciated.

jezrael · Accepted Answer

You can use:

df = pd.DataFrame({'A':[1,2,3,4,5,6,6,5,4]}).astype(str)
print (df)
   A
0  1
1  2
2  3
3  4
4  5
5  6
6  6
7  5
8  4

First get all values below treshold with value_counts and boolean indexing and in dict comprehension add same scalar value like 0. Last replace:

tresh = 2
s = df['A'].value_counts()
d = {x:0 for x in s[s < tresh].index}
print (d)
{'1': 0, '3': 0, '2': 0}

df = df.replace(d)
print (df)
   A
0  0
1  0
2  0
3  4
4  5
5  6
6  6
7  5
8  4

print (pd.get_dummies(df, prefix='', prefix_sep=''))
   0  4  5  6
0  1  0  0  0
1  1  0  0  0
2  1  0  0  0
3  0  1  0  0
4  0  0  1  0
5  0  0  0  1
6  0  0  0  1
7  0  0  1  0
8  0  1  0  0

how to one-hot-encode values in a columns，while treating some values as one same category

Answers (1)

Related Questions