Reputation: 109
I have a Pandas dataframe that looks like this:
Date Time Spent (seconds) Activity
0 2017-03-23T00:00:00 92 netflix.com
1 2017-03-23T00:05:00 158 netflix.com
2 2017-03-23T00:25:00 285 netflix.com
3 2017-03-23T00:30:00 5 netflix.com
4 2017-03-23T00:40:00 214 netflix.com
5 2017-03-23T00:45:00 300 netflix.com
6 2017-03-23T00:45:00 5 Google Calendar for Android
7 2017-03-23T00:45:00 3 Google Now
8 2017-03-23T00:45:00 1 LinkedIn - Android
9 2017-03-23T00:50:00 33 netflix.com
10 2017-03-23T01:10:00 167 netflix.com
When I do value_counts on the Series Activity I get the following:
WhatsApp Messenger Android 1111
netflix.com 881
mendeley desktop 756
sharelatex.com 722
Google Now 647
newtab 584
google.co.uk 501
microsoft word 449
I would like to replace all items in the series Activity in the original dataframe that have a count/occurrence less than 20 with the string 'other'.
I've considered/tried doing this by iterating manually through the dataframe and replacing them, but my dataframe has several tens of thousands of rows and that is very inefficient. What would be a better way to achieve this?
Upvotes: 2
Views: 105
Reputation: 210872
df.loc[df.Activity.isin(vc.index[vc<20].values), 'Activity'] = 'other'
where vc
is the result of value_counts
Upvotes: 1
Reputation: 365
You could use pd.Series.map which is very fast:
VC = df['Activity'].value_counts()
df['Activity'] = df['Activity'].map(lambda p : p if VC[p]>20 else 'other')
Upvotes: 0
Reputation: 153470
Let's use groupby
and transform
:
df.assign(Activity=df.groupby('Activity')['Activity'].transform(lambda x: x if x.size>=20 else 'other'))
Upvotes: 4