Abdel Wahab Turkmani
Abdel Wahab Turkmani

Reputation: 109

How to replace uncommon elements in Pandas dataframe

I have a Pandas dataframe that looks like this:

                          Date  Time Spent (seconds)                     Activity
    0      2017-03-23T00:00:00                    92                  netflix.com
    1      2017-03-23T00:05:00                   158                  netflix.com
    2      2017-03-23T00:25:00                   285                  netflix.com
    3      2017-03-23T00:30:00                     5                  netflix.com
    4      2017-03-23T00:40:00                   214                  netflix.com
    5      2017-03-23T00:45:00                   300                  netflix.com
    6      2017-03-23T00:45:00                     5  Google Calendar for Android
    7      2017-03-23T00:45:00                     3                   Google Now
    8      2017-03-23T00:45:00                     1           LinkedIn - Android
    9      2017-03-23T00:50:00                    33                  netflix.com
    10     2017-03-23T01:10:00                   167                  netflix.com                          

When I do value_counts on the Series Activity I get the following:

    WhatsApp Messenger Android            1111
    netflix.com                            881
    mendeley desktop                       756
    sharelatex.com                         722
    Google Now                             647
    newtab                                 584
    google.co.uk                           501
    microsoft word                         449

I would like to replace all items in the series Activity in the original dataframe that have a count/occurrence less than 20 with the string 'other'.

I've considered/tried doing this by iterating manually through the dataframe and replacing them, but my dataframe has several tens of thousands of rows and that is very inefficient. What would be a better way to achieve this?

Upvotes: 2

Views: 105

Answers (3)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210872

df.loc[df.Activity.isin(vc.index[vc<20].values), 'Activity'] = 'other'

where vc is the result of value_counts

Upvotes: 1

Sergey Sergienko
Sergey Sergienko

Reputation: 365

You could use pd.Series.map which is very fast:

VC = df['Activity'].value_counts()
df['Activity'] = df['Activity'].map(lambda p : p if VC[p]>20 else 'other')

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153470

Let's use groupby and transform:

df.assign(Activity=df.groupby('Activity')['Activity'].transform(lambda x: x if x.size>=20 else 'other'))

Upvotes: 4

Related Questions