Khaned
Khaned

Reputation: 443

Replace the values in a column based on frequency

I have a dataframe (3.7 million rows) with a column with different country names

   id  Country
   1   RUSSIA     
   2   USA     
   3   RUSSIA   
   4   RUSSIA   
   5   INDIA   
   6   USA   
   7   USA   
   8   ITALY   
   9   USA   
   10  RUSSIA   

I want to replace INDIA and ITALY with "Miscellanous" because they occur less than 15% in the column

My alternate solution is to replace the names with there frequency using

df.column_name = df.column_name.map(df.column_name.value_counts())

Upvotes: 1

Views: 682

Answers (4)

Grayrigel
Grayrigel

Reputation: 3594

You can use dictionary and map for this:

d = df.Country.value_counts(normalize=True).to_dict()
df.Country.map(lambda x : x  if d[x] > 0.15 else 'Miscellanous' )

Output:

id
1           RUSSIA
2              USA
3           RUSSIA
4           RUSSIA
5     Miscellanous
6              USA
7              USA
8     Miscellanous
9              USA
10          RUSSIA
Name: Country, dtype: object

Upvotes: 1

rhug123
rhug123

Reputation: 8768

Here is another option

s = df.value_counts()
s = s/s.sum()
s = s.loc[s<.15].reset_index()
df = df.replace(s['Place'].tolist(),'Miscellanous')

Upvotes: 1

ansev
ansev

Reputation: 30920

Use:

df.loc[df.groupby('Country')['id']
         .transform('size')
         .div(len(df))
         .lt(0.15), 
       'Country'] = 'Miscellanous'

Or

df.loc[df['Country'].map(df['Country'].value_counts(normalize=True)
                                      .lt(0.15)), 
       'Country'] = 'Miscellanous'

Upvotes: 5

Code Different
Code Different

Reputation: 93161

If you want to put all country whose frequency is less than a threshold into the "Misc" category:

threshold = 0.15
freq = df['Country'].value_counts(normalize=True)
mappings = freq.index.to_series().mask(freq < threshold, 'Misc').to_dict()

df['Country'].map(mappings)

Upvotes: 1

Related Questions