Reputation: 23
I want to show top 5 most occurring queries per city, if there aren't 5 queries within a city, show them all.
Dummy data
data = {'city': ['Paris','Berlin','Barcelona', 'Paris', 'Paris', 'Barcelona', 'Barcelona', 'Barcelona', 'Berlin', 'Berlin',\
'Paris','Berlin','Barcelona', 'Paris', 'Paris', 'Barcelona', 'Barcelona', 'Barcelona', 'Berlin', 'Berlin',\
'Paris','Berlin','Barcelona', 'Paris', 'Paris', 'Barcelona', 'Barcelona', 'Barcelona', 'Berlin', 'Berlin',\
'Paris','Berlin','Barcelona', 'Paris', 'Paris', 'Barcelona', 'Barcelona', 'Barcelona', 'Berlin', 'Berlin'],
'query': ['orange', 'pizza', 'pizza', 'pizza', 'apple', 'pizza', 'ricecracker', 'pizza', 'tomato', 'tomato',\
'orange', 'pizza', 'ricecracker', 'ricecracker', 'pineapple', 'pizza', 'ricecracker', 'pizza', 'ricecracker', 'tomato',\
'taco', 'taco', 'pizza', 'pizza', 'pineapple', 'pizza', 'ricecracker', 'pizza', 'tomato', 'tomato',\
'apple', 'taco', 'ricecracker', 'ricecracker', 'pineapple', 'pizza', 'ricecracker', 'pizza', 'ricecracker', 'tomato']
}
df = pd.DataFrame(data, columns = ['city', 'query'])
Desired output
city query count
Barcelona pizza 10
ricecracker 6
Berlin tomato 6
pizza 2
ricecracker 2
taco 2
Paris pineapple 3
apple 2
orange 2
pizza 2
ricecracker 2
I have tried the following for just 1 city, but no idea how to translate it into a groupby:
df[df.city == 'Paris']['query'].value_counts().nlargest(5)
Upvotes: 2
Views: 163
Reputation: 2504
Try this:
def get_top_n(df,n):
to_be_merged = []
for cities in set(df.city):
df1 = df.loc[df.city==cities]
z1 = df1.groupby('city')['query'].value_counts().head(n)
to_be_merged.append(z1)
return pd.concat(to_be_merged)
This yields your desired result (since you want top 5 queries per city):
Upvotes: 1
Reputation: 149185
You can get the number or queries per city with a simple groupby:
df.assign(count=1).groupby(['city', 'query']).count()
which gives:
count
city query
Barcelona pizza 5
ricecracker 3
Berlin pizza 2
ricecracker 1
tomato 3
Paris pineapple 2
pizza 3
ricecracker 1
If you want to keep only the five most frequent queries, you can filter the result with a second groupby:
df.assign(count=1).groupby(['city', 'query']).count().groupby(level=0).apply(lambda x:
x.sort_values('count', ascending=False).head(5)).reset_index(level=0, drop=True)
which gives here the expected query order:
count
city query
Barcelona pizza 5
ricecracker 3
Berlin tomato 3
pizza 2
ricecracker 1
Paris pizza 3
pineapple 2
ricecracker 1
Here is a demo limiting to 2 queries:
df.assign(count=1).groupby(['city', 'query']).count().groupby(level=0).apply(lambda x:
x.sort_values('count', ascending=False).head(2)).reset_index(level=0, drop=True)
count
city query
Barcelona pizza 5
ricecracker 3
Berlin tomato 3
pizza 2
Paris pizza 3
pineapple 2
Upvotes: 1