Reputation: 2127
I have a large dataset that looks like:
Shop Date Hour Ending Hours Operating Produced
Cornerstop 01-01-2010 0 1 9
Cornerstop 01-01-2010 1 1 11
Cornerstop 01-01-2010 2 1 10
.
.
Cornerstop 01-01-2010 23 1 0
Leaf Grove 01-01-2010 0 1 7
Leaf Grove 01-01-2010 1 1 4
Leaf Grove 01-01-2010 2 1 2
I want to find out which shops are the top 20 shops by how much they've produced. I've used data.describe()
to check the top percentiles but this doesn't help me because if I threshold on the top percentile of 'Produced' some days are lost in the data.
This is a newbie question but how can I easily pick and target these top shops based on this criteria? Perhaps use the percentile just to create a range of the top shops and just cut those out in the dataset? Feels like there's a much better way to do this.
Upvotes: 0
Views: 670
Reputation: 3919
What about the following to sort the column and then take the top 20?
df= df.sort_values(['Produced'], ascending=[False])
df.head(20)
Upvotes: 0
Reputation: 59549
Use .nlargest
df.groupby('Shop').Produced.sum().nlargest(20)
Add .index.tolist()
if you just need a list of Shops.
Upvotes: 0
Reputation: 9019
Use sort_values()
and head()
:
df.sort_values('Produced', ascending=False).head(20)
If you want to sum the production values for each shop and then sort, you can do:
df.groupby('Shop').agg({'Produced': 'sum'}).sort_values('Produced', ascending=False).head(20)
Upvotes: 2