Steve
Steve

Reputation: 23

Pandas: top 10 string occurrences within group

I want to show top 5 most occurring queries per city, if there aren't 5 queries within a city, show them all.

Dummy data

data = {'city': ['Paris','Berlin','Barcelona', 'Paris', 'Paris', 'Barcelona', 'Barcelona', 'Barcelona', 'Berlin', 'Berlin',\
                 'Paris','Berlin','Barcelona', 'Paris', 'Paris', 'Barcelona', 'Barcelona', 'Barcelona', 'Berlin', 'Berlin',\
                 'Paris','Berlin','Barcelona', 'Paris', 'Paris', 'Barcelona', 'Barcelona', 'Barcelona', 'Berlin', 'Berlin',\
                 'Paris','Berlin','Barcelona', 'Paris', 'Paris', 'Barcelona', 'Barcelona', 'Barcelona', 'Berlin', 'Berlin'],
        'query': ['orange', 'pizza', 'pizza', 'pizza', 'apple', 'pizza', 'ricecracker', 'pizza', 'tomato', 'tomato',\
                  'orange', 'pizza', 'ricecracker', 'ricecracker', 'pineapple', 'pizza', 'ricecracker', 'pizza', 'ricecracker', 'tomato',\
                  'taco', 'taco', 'pizza', 'pizza', 'pineapple', 'pizza', 'ricecracker', 'pizza', 'tomato', 'tomato',\
                  'apple', 'taco', 'ricecracker', 'ricecracker', 'pineapple', 'pizza', 'ricecracker', 'pizza', 'ricecracker', 'tomato']
       }

df = pd.DataFrame(data, columns = ['city', 'query'])

Desired output

city       query        count
Barcelona  pizza          10
           ricecracker    6
Berlin     tomato         6
           pizza          2
           ricecracker    2
           taco           2
Paris      pineapple      3
           apple          2
           orange         2   
           pizza          2  
           ricecracker    2

I have tried the following for just 1 city, but no idea how to translate it into a groupby:

df[df.city == 'Paris']['query'].value_counts().nlargest(5)

Upvotes: 2

Views: 163

Answers (2)

JP Maulion
JP Maulion

Reputation: 2504

Try this:

def get_top_n(df,n):
    to_be_merged = []
    for cities in set(df.city):
        df1 = df.loc[df.city==cities]
        z1 = df1.groupby('city')['query'].value_counts().head(n)
        to_be_merged.append(z1)
    return pd.concat(to_be_merged)

This yields your desired result (since you want top 5 queries per city): enter image description here

Upvotes: 1

Serge Ballesta
Serge Ballesta

Reputation: 149185

You can get the number or queries per city with a simple groupby:

df.assign(count=1).groupby(['city', 'query']).count()

which gives:

                       count
city      query             
Barcelona pizza            5
          ricecracker      3
Berlin    pizza            2
          ricecracker      1
          tomato           3
Paris     pineapple        2
          pizza            3
          ricecracker      1

If you want to keep only the five most frequent queries, you can filter the result with a second groupby:

df.assign(count=1).groupby(['city', 'query']).count().groupby(level=0).apply(lambda x:
             x.sort_values('count', ascending=False).head(5)).reset_index(level=0, drop=True)

which gives here the expected query order:

                       count
city      query             
Barcelona pizza            5
          ricecracker      3
Berlin    tomato           3
          pizza            2
          ricecracker      1
Paris     pizza            3
          pineapple        2
          ricecracker      1

Here is a demo limiting to 2 queries:

df.assign(count=1).groupby(['city', 'query']).count().groupby(level=0).apply(lambda x: 
            x.sort_values('count', ascending=False).head(2)).reset_index(level=0, drop=True)

                       count
city      query             
Barcelona pizza            5
          ricecracker      3
Berlin    tomato           3
          pizza            2
Paris     pizza            3
          pineapple        2

Upvotes: 1

Related Questions