MysterioProgrammer91
MysterioProgrammer91

Reputation: 569

Filtering pandas dataframe based on groupby (only top 3)

I have a DataFrame with thousands of rows and 20 columns. The date is the index and there are many same dates. Example df:

           Stock    Sales   Data 1        Data 2
1/1/2012    Apple   120     0.996691907    0.376607328
1/1/2012    Apple   230     0.084699221    0.56433743
1/1/2012    Apple   340     0.141253424    0.319522467
1/1/2012    Berry   230     0.506264018     0.123657902
1/1/2012    Berry   340     0.646633737     0.635841995
1/1/2012    Cat     1250    0.204030887    0.928827628
1/1/2012    Cat     850     0.556935133    0.81033956
1/1/2012    Cat     650     0.771751177    0.988848472
1/1/2012    Cat     650     0.615222763    0.468555772
1/2/2012    Apple   1065    0.504410742    0.402553442
1/2/2012    Apple   200     0.752335341    0.487556857
1/2/2012    BlackBerry  1465    0.693017964 0.925737402
1/2/2012    BlackBerry  2000    0.262392424 0.076542936
1/2/2012    BlackBerry  1465    0.851841806 0.345077839
1/2/2012    BlackBerry  1465    0.70635569  0.718340524
1/2/2012    Tomato  700     0.911297224    0.155699549
1/2/2012    Tomato  235     0.118843588    0.662083069
1/2/2012    Carrot  500 0.07255267  0.585773563

I want to filter the data such that for each date and for each stock I only show a maximum of 3 rows, and I choose these 3 based on the ones which have the maximum Sales.

If there are instances when there are only 1 or 2 within each date and stock, then naturally it would maintain all the rows.

If there are 3 or more rows for a date and stock group then I only want 3 rows for the 3 max sales. If there is a joint 3rd position (With same sales numbers), I still only want MAXIMUM 3 rows for that date and stock, so either by random selection or any other suitable method, I would still spit out 3 rows for that stock for a particular date.

Example output could be something like this:

       Stock    Sales   Data 1        Data 2
1/1/2012    Apple   120     0.996691907    0.376607328
1/1/2012    Apple   230     0.084699221    0.56433743
1/1/2012    Apple   340     0.141253424    0.319522467
1/1/2012    Berry   230     0.506264018     0.123657902
1/1/2012    Berry   340     0.646633737     0.635841995
1/1/2012    Cat     1250    0.204030887    0.928827628
1/1/2012    Cat     850     0.556935133    0.81033956
1/1/2012    Cat     650     0.771751177    0.988848472
1/2/2012    Apple   1065    0.504410742    0.402553442
1/2/2012    Apple   200     0.752335341    0.487556857
1/2/2012    BlackBerry  2000    0.262392424 0.076542936
1/2/2012    BlackBerry  1465    0.851841806 0.345077839
1/2/2012    BlackBerry  1465    0.70635569  0.718340524
1/2/2012    Tomato  700     0.911297224    0.155699549
1/2/2012    Tomato  235     0.118843588    0.662083069
1/2/2012    Carrot  500 0.07255267  0.585773563

Upvotes: 1

Views: 1584

Answers (2)

miradulo
miradulo

Reputation: 29690

You can just use groupby combined with nlargest to accomplish this.

>>> data.groupby([data.index, data.Stock]).Sales.nlargest(3)

          Stock               
1/1/2012  Apple       1/1/2012     340
                      1/1/2012     230
                      1/1/2012     120
          Berry       1/1/2012     340
                      1/1/2012     230
          Cat         1/1/2012    1250
                      1/1/2012     850
                      1/1/2012     650
1/2/2012  Apple       1/2/2012    1065
                      1/2/2012     200
          BlackBerry  1/2/2012    2000
                      1/2/2012    1465
                      1/2/2012    1465
          Carrot      1/2/2012     500
          Tomato      1/2/2012     700
                      1/2/2012     235
Name: Sales, dtype: int64

And of course if you wanted to output the full subset of your DataFrame as opposed to only the relevant info, we can use iloc.

>>> data.iloc[data.reset_index().groupby(['index', 'Stock'])
                                .Sales.nlargest(3).index.levels[2]]

               Stock  Sales     Data1     Data2
1/1/2012       Apple    120  0.996692  0.376607
1/1/2012       Apple    230  0.084699  0.564337
1/1/2012       Apple    340  0.141253  0.319522
1/1/2012       Berry    230  0.506264  0.123658
1/1/2012       Berry    340  0.646634  0.635842
1/1/2012         Cat   1250  0.204031  0.928828
1/1/2012         Cat    850  0.556935  0.810340
1/1/2012         Cat    650  0.771751  0.988848
1/2/2012       Apple   1065  0.504411  0.402553
1/2/2012       Apple    200  0.752335  0.487557
1/2/2012  BlackBerry   1465  0.693018  0.925737
1/2/2012  BlackBerry   2000  0.262392  0.076543
1/2/2012  BlackBerry   1465  0.851842  0.345078
1/2/2012      Tomato    700  0.911297  0.155700
1/2/2012      Tomato    235  0.118844  0.662083
1/2/2012      Carrot    500  0.072553  0.585774

Upvotes: 2

jspring
jspring

Reputation: 71

Using sort_values(), groupby(), and head() seems to yield the results you are looking for.

import pandas as pd

df = pd.read_table('fruit', sep='\s+')
df.Date = pd.to_datetime(df.Date)

df.sort_values(by=['Date', 'Stock', 'Sales'], 
               ascending=[True, True, False], 
               inplace=True)

#             Date       Stock  Sales     Data1     Data2
#    2  2012-01-01       Apple    340  0.141253  0.319522
#    1  2012-01-01       Apple    230  0.084699  0.564337
#    0  2012-01-01       Apple    120  0.996692  0.376607
#    4  2012-01-01       Berry    340  0.646634  0.635842
#    3  2012-01-01       Berry    230  0.506264  0.123658
#    5  2012-01-01         Cat   1250  0.204031  0.928828
#    6  2012-01-01         Cat    850  0.556935  0.810340
#    7  2012-01-01         Cat    650  0.771751  0.988848
#    8  2012-01-01         Cat    650  0.615223  0.468556
#    9  2012-01-02       Apple   1065  0.504411  0.402553
#    10 2012-01-02       Apple    200  0.752335  0.487557
#    12 2012-01-02  BlackBerry   2000  0.262392  0.076543
#    11 2012-01-02  BlackBerry   1465  0.693018  0.925737
#    13 2012-01-02  BlackBerry   1465  0.851842  0.345078
#    14 2012-01-02  BlackBerry   1465  0.706356  0.718341
#    17 2012-01-02      Carrot    500  0.072553  0.585774
#    15 2012-01-02      Tomato    700  0.911297  0.155700
#    16 2012-01-02      Tomato    235  0.118844  0.662083



df.groupby(by=['Date','Stock'], as_index=False, sort=False).head(3)

print df

#             Date       Stock  Sales     Data1     Data2
#    2  2012-01-01       Apple    340  0.141253  0.319522
#    1  2012-01-01       Apple    230  0.084699  0.564337
#    0  2012-01-01       Apple    120  0.996692  0.376607
#    4  2012-01-01       Berry    340  0.646634  0.635842
#    3  2012-01-01       Berry    230  0.506264  0.123658
#    5  2012-01-01         Cat   1250  0.204031  0.928828
#    6  2012-01-01         Cat    850  0.556935  0.810340
#    7  2012-01-01         Cat    650  0.771751  0.988848
#    9  2012-01-02       Apple   1065  0.504411  0.402553
#    10 2012-01-02       Apple    200  0.752335  0.487557
#    12 2012-01-02  BlackBerry   2000  0.262392  0.076543
#    11 2012-01-02  BlackBerry   1465  0.693018  0.925737
#    13 2012-01-02  BlackBerry   1465  0.851842  0.345078
#    17 2012-01-02      Carrot    500  0.072553  0.585774
#    15 2012-01-02      Tomato    700  0.911297  0.155700
#    16 2012-01-02      Tomato    235  0.118844  0.662083

Upvotes: 1

Related Questions