show top 5 rows for each group in dataframe with multiple numeric columns

I have the following dataframe which I obtained using: df.groupby(['departamento','campo']).describe()

df_statistics:

                            produccion                                         
                                mean           std          min           max
departamento campo                                                           
f7fd2c4f     8dd7c41b    4714.695603   1076.940951  3091.015553   6378.546534
             82edafb9    1851.291482    841.512944   675.814722   3006.476183
             58a0d8ca    1768.151315    347.896113  1033.459536   2242.544338
             8ba362f3     257.917212    231.490925     0.000000    497.916659
             4f4a249f     192.811711     80.299111   129.190598    356.437730
             741abe20     431.717352     71.053604   291.831556    529.518332
             51cbb05d     489.804186     65.542073   353.186216    582.869264
             4d0fb45e     358.597250     30.166391   314.168045    407.842103
             c98bd9dd     437.244383     27.135823   402.546159    481.245852
             7eb34927     106.426374     22.579237    81.994706    142.283652
ec12ad00     44502c89      15.015145     11.467353     0.000000     29.241879
             5558f26e       1.107400      0.959445     0.000000      2.762156
             85c1a0e5       0.122720      0.425113     0.000000      1.472635
cf33cb8a     2f614c0b   12458.858168  12042.715975   150.635367  25999.977584
             5559f8d7    4272.447078   1326.999765  2458.231739   6059.658900
             fd6f6562    3378.712031   1194.101786   869.763739   4814.220212
             febb6cf6    4149.936221    833.663173  2471.139924   5827.822674
             d56beadb     474.831361    810.840341     0.000000   2283.465569
             124207de    3863.484888    796.945367  2713.111304   5150.735620
             1f d2689f   6099.963902    768.102604  4766.241346   7897.993261
             c728bf96    3361.623457    704.293795  2203.721911   4949.989960

I have sorted the dataframe based on the standard deviation ('std') column, but I want to show only the top 5 values for each group in the column 'departamento'.

I tried the following code: df_statistics.nlargest(5, columns =('produccion','std'))

but I get the top 5 overall the groups in the column 'departamento':

                            produccion                                         
                               mean           std          min           max
departamento campo                                                          
cf33cb8a     2f614c0b  12458.858168  12042.715975   150.635367  25999.977584
             5559f8d7   4272.447078   1326.999765  2458.231739   6059.658900
             fd6f6562   3378.712031   1194.101786   869.763739   4814.220212
f7fd2c4f     8dd7c41b   4714.695603   1076.940951  3091.015553   6378.546534
             82edafb9   1851.291482    841.512944   675.814722   3006.476183

How can I show the top 5 values for each group based on the column 'std'

Upvotes: 2

Answers (2)

Scott Boston

Reputation: 153460

IIUC,

df.groupby('departamento').head(5)

Output:

                         produccion                                         
                               mean           std          min           max
departamento campo                                                          
f7fd2c4f     8dd7c41b   4714.695603   1076.940951  3091.015553   6378.546534
             82edafb9   1851.291482    841.512944   675.814722   3006.476183
             58a0d8ca   1768.151315    347.896113  1033.459536   2242.544338
             8ba362f3    257.917212    231.490925     0.000000    497.916659
             4f4a249f    192.811711     80.299111   129.190598    356.437730
ec12ad00     44502c89     15.015145     11.467353     0.000000     29.241879
             5558f26e      1.107400      0.959445     0.000000      2.762156
             85c1a0e5      0.122720      0.425113     0.000000      1.472635
cf33cb8a     2f614c0b  12458.858168  12042.715975   150.635367  25999.977584
             5559f8d7   4272.447078   1326.999765  2458.231739   6059.658900
             fd6f6562   3378.712031   1194.101786   869.763739   4814.220212
             febb6cf6   4149.936221    833.663173  2471.139924   5827.822674
             d56beadb    474.831361    810.840341     0.000000   2283.465569

@recentadvance is correct,

df.sort_values(by=('produccion',  'std'), ascending=False)\
  .groupby('departamento')\
  .head(5)\
  .sort_index()

Sort dataframe first, then groupby with head and sort_index.

Upvotes: 1

recentadvances

Reputation: 177

Use another groupby:

df_statistics.groupby('departamento')\
             .apply(lambda grp: grp.nlargest(5, columns=('produccion', 'std')))

Upvotes: 1

show top 5 rows for each group in dataframe with multiple numeric columns

Answers (2)

Related Questions