Naman Sood
Naman Sood

Reputation: 47

Extracting top-N occurrences in a grouped dataframe using pandas

I've been trying to find out the top-3 highest frequency restaurant names under each type of restaurant

enter image description here

The columns are:

rest_type - Column for the type of restaurant

name - Column for the name of the restaurant

url - Column used for counting occurrences

This was the code that ended up working for me after some searching:

df_1=df.groupby(['rest_type','name']).agg('count')
datas=df_1.groupby(['rest_type'], as_index=False).apply(lambda x : x.sort_values(by="url",ascending=False).head(3))
['url'].reset_index().rename(columns={'url':'count'})

The final output was as follows:

enter image description here

I had a few questions pertaining to the above code:

How are we able to groupby using rest_type again for datas variable after grouping it earlier. Should it not give the missing column error? The second groupby operation is a bit confusing to me.

What does the first formulated column level_0 signify? I tried the code with as_index=True and it created an index and column pertaining to rest_type so I couldn't reset the index. Output below:

enter image description here

Thank you

Upvotes: 2

Views: 653

Answers (2)

Rahul Anand
Rahul Anand

Reputation: 81

I have a similar case where the above query looks working partially. In my case the cooccurrence value is coming as 1 always. Here in my input data frame. enter image description here

And my query is below

top_five_family_cooccurence_df = (common_top25_cooccurance1_df.groupby('family') .apply(lambda x: x['related_family'].value_counts().nlargest(5)) .reset_index().rename(columns={'related_family': 'cooccurence', 'level_1': 'related_family'}) )

I am getting result as enter image description here

Where as The cooccurrence is always giving me 1.

Upvotes: 0

mozway
mozway

Reputation: 260640

You can use groupby a second time as it is present in the index which is recognized by groupby.

level_0 comes from the reset_index command because you index is unnamed.

That said, and provided I understand your dataset, I feel that you could achieve your goal more easily:

import random
df = pd.DataFrame({'rest_type': random.choices('ABCDEF', k=20),
                   'name': random.choices('abcdef', k=20),
                   'url': range(20), # looks like this is a unique identifier
                  })

def tops(s, n=3):
    return s.value_counts().sort_values(ascending=False).head(n)

df.groupby('rest_type')['name'].apply(tops, n=3)

edit: here is an alternative to format the result as a dataframe with informative column names

(df.groupby('rest_type')
   .apply(lambda x: x['name'].value_counts().nlargest(3))
   .reset_index().rename(columns={'name': 'counts', 'level_1': 'name'})
)

Upvotes: 2

Related Questions