Remove entries based on group by

Question

I have a dataset which looks like this:

venue_id,latitude,longitude,venue_category,country_code,user_id,uct_time,time_offset
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,4337,Tue Apr 03 20:35:48 +0000 2012,420
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,101773,Tue Apr 03 20:46:53 +0000 2012,420
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,105093,Tue Apr 03 22:39:56 +0000 2012,420
4af833a6f964a5205a0b22e3,13.693775,100.751152,Airport,TH,58835,Tue Apr 03 22:54:52 +0000 2012,420
....

and I need to remove the venue_id that have less than 100 occurrences.

I have tried to use the following code:

joined = joined[joined.groupby("venue_id").venue_id.transform(len) >= 100]

which is inspired by the answer from the question with ID 13446480.

The problem is that it gives me the following error:

AttributeError: 'DataFrameGroupBy' object has no attribute 'venue_id'

Please bear in mind that I new to Pandas and I want to learn, so if you could give some explanation as well I would be grateful.

Cheers,

Dan

jezrael · Accepted Answer

It seems first column is index, so help reset_index.

So need:

joined = joined.reset_index()
joined = joined[joined.groupby("venue_id")['venue_id'].transform(len) >= 100]

Also for me works if first column is index and dont need reset_index:

joined = joined[joined.groupby("venue_id").transform(len) >= 100]

If dont use last versions of pandas (0.20.1) then is necessary add some column:

joined = joined[joined.groupby(level="venue_id")['latitude'].transform(len) >= 100]

EDIT1:

Faster is use size as len.

joined = joined[joined.groupby("venue_id")['latitude'].transform('size') >= 100]

Remove entries based on group by

Answers (1)

Related Questions