HelloToEarth
HelloToEarth

Reputation: 2127

Checking top values for dataframe columns in Python

I have a large dataset that looks like:

Shop              Date           Hour Ending         Hours Operating        Produced   
Cornerstop        01-01-2010          0                     1                   9
Cornerstop        01-01-2010          1                     1                   11
Cornerstop        01-01-2010          2                     1                   10
. 
.

Cornerstop        01-01-2010          23                    1                   0
Leaf Grove        01-01-2010          0                     1                   7
Leaf Grove        01-01-2010          1                     1                   4
Leaf Grove        01-01-2010          2                     1                   2

I want to find out which shops are the top 20 shops by how much they've produced. I've used data.describe() to check the top percentiles but this doesn't help me because if I threshold on the top percentile of 'Produced' some days are lost in the data.

This is a newbie question but how can I easily pick and target these top shops based on this criteria? Perhaps use the percentile just to create a range of the top shops and just cut those out in the dataset? Feels like there's a much better way to do this.

Upvotes: 0

Views: 670

Answers (3)

Taylrl
Taylrl

Reputation: 3919

What about the following to sort the column and then take the top 20?

df= df.sort_values(['Produced'], ascending=[False])
df.head(20)

Upvotes: 0

ALollz
ALollz

Reputation: 59549

Use .nlargest

df.groupby('Shop').Produced.sum().nlargest(20)

Add .index.tolist() if you just need a list of Shops.

Upvotes: 0

rahlf23
rahlf23

Reputation: 9019

Use sort_values() and head():

df.sort_values('Produced', ascending=False).head(20)

If you want to sum the production values for each shop and then sort, you can do:

df.groupby('Shop').agg({'Produced': 'sum'}).sort_values('Produced', ascending=False).head(20)

Upvotes: 2

Related Questions