Reputation: 569
I have a DataFrame with thousands of rows and 20 columns. The date is the index and there are many same dates. Example df:
Stock Sales Data 1 Data 2
1/1/2012 Apple 120 0.996691907 0.376607328
1/1/2012 Apple 230 0.084699221 0.56433743
1/1/2012 Apple 340 0.141253424 0.319522467
1/1/2012 Berry 230 0.506264018 0.123657902
1/1/2012 Berry 340 0.646633737 0.635841995
1/1/2012 Cat 1250 0.204030887 0.928827628
1/1/2012 Cat 850 0.556935133 0.81033956
1/1/2012 Cat 650 0.771751177 0.988848472
1/1/2012 Cat 650 0.615222763 0.468555772
1/2/2012 Apple 1065 0.504410742 0.402553442
1/2/2012 Apple 200 0.752335341 0.487556857
1/2/2012 BlackBerry 1465 0.693017964 0.925737402
1/2/2012 BlackBerry 2000 0.262392424 0.076542936
1/2/2012 BlackBerry 1465 0.851841806 0.345077839
1/2/2012 BlackBerry 1465 0.70635569 0.718340524
1/2/2012 Tomato 700 0.911297224 0.155699549
1/2/2012 Tomato 235 0.118843588 0.662083069
1/2/2012 Carrot 500 0.07255267 0.585773563
I want to filter the data such that for each date and for each stock I only show a maximum of 3 rows, and I choose these 3 based on the ones which have the maximum Sales.
If there are instances when there are only 1 or 2 within each date and stock, then naturally it would maintain all the rows.
If there are 3 or more rows for a date and stock group then I only want 3 rows for the 3 max sales. If there is a joint 3rd position (With same sales numbers), I still only want MAXIMUM 3 rows for that date and stock, so either by random selection or any other suitable method, I would still spit out 3 rows for that stock for a particular date.
Example output could be something like this:
Stock Sales Data 1 Data 2
1/1/2012 Apple 120 0.996691907 0.376607328
1/1/2012 Apple 230 0.084699221 0.56433743
1/1/2012 Apple 340 0.141253424 0.319522467
1/1/2012 Berry 230 0.506264018 0.123657902
1/1/2012 Berry 340 0.646633737 0.635841995
1/1/2012 Cat 1250 0.204030887 0.928827628
1/1/2012 Cat 850 0.556935133 0.81033956
1/1/2012 Cat 650 0.771751177 0.988848472
1/2/2012 Apple 1065 0.504410742 0.402553442
1/2/2012 Apple 200 0.752335341 0.487556857
1/2/2012 BlackBerry 2000 0.262392424 0.076542936
1/2/2012 BlackBerry 1465 0.851841806 0.345077839
1/2/2012 BlackBerry 1465 0.70635569 0.718340524
1/2/2012 Tomato 700 0.911297224 0.155699549
1/2/2012 Tomato 235 0.118843588 0.662083069
1/2/2012 Carrot 500 0.07255267 0.585773563
Upvotes: 1
Views: 1584
Reputation: 29690
You can just use groupby
combined with nlargest
to accomplish this.
>>> data.groupby([data.index, data.Stock]).Sales.nlargest(3)
Stock
1/1/2012 Apple 1/1/2012 340
1/1/2012 230
1/1/2012 120
Berry 1/1/2012 340
1/1/2012 230
Cat 1/1/2012 1250
1/1/2012 850
1/1/2012 650
1/2/2012 Apple 1/2/2012 1065
1/2/2012 200
BlackBerry 1/2/2012 2000
1/2/2012 1465
1/2/2012 1465
Carrot 1/2/2012 500
Tomato 1/2/2012 700
1/2/2012 235
Name: Sales, dtype: int64
And of course if you wanted to output the full subset of your DataFrame as opposed to only the relevant info, we can use iloc
.
>>> data.iloc[data.reset_index().groupby(['index', 'Stock'])
.Sales.nlargest(3).index.levels[2]]
Stock Sales Data1 Data2
1/1/2012 Apple 120 0.996692 0.376607
1/1/2012 Apple 230 0.084699 0.564337
1/1/2012 Apple 340 0.141253 0.319522
1/1/2012 Berry 230 0.506264 0.123658
1/1/2012 Berry 340 0.646634 0.635842
1/1/2012 Cat 1250 0.204031 0.928828
1/1/2012 Cat 850 0.556935 0.810340
1/1/2012 Cat 650 0.771751 0.988848
1/2/2012 Apple 1065 0.504411 0.402553
1/2/2012 Apple 200 0.752335 0.487557
1/2/2012 BlackBerry 1465 0.693018 0.925737
1/2/2012 BlackBerry 2000 0.262392 0.076543
1/2/2012 BlackBerry 1465 0.851842 0.345078
1/2/2012 Tomato 700 0.911297 0.155700
1/2/2012 Tomato 235 0.118844 0.662083
1/2/2012 Carrot 500 0.072553 0.585774
Upvotes: 2
Reputation: 71
Using sort_values(), groupby(), and head() seems to yield the results you are looking for.
import pandas as pd
df = pd.read_table('fruit', sep='\s+')
df.Date = pd.to_datetime(df.Date)
df.sort_values(by=['Date', 'Stock', 'Sales'],
ascending=[True, True, False],
inplace=True)
# Date Stock Sales Data1 Data2
# 2 2012-01-01 Apple 340 0.141253 0.319522
# 1 2012-01-01 Apple 230 0.084699 0.564337
# 0 2012-01-01 Apple 120 0.996692 0.376607
# 4 2012-01-01 Berry 340 0.646634 0.635842
# 3 2012-01-01 Berry 230 0.506264 0.123658
# 5 2012-01-01 Cat 1250 0.204031 0.928828
# 6 2012-01-01 Cat 850 0.556935 0.810340
# 7 2012-01-01 Cat 650 0.771751 0.988848
# 8 2012-01-01 Cat 650 0.615223 0.468556
# 9 2012-01-02 Apple 1065 0.504411 0.402553
# 10 2012-01-02 Apple 200 0.752335 0.487557
# 12 2012-01-02 BlackBerry 2000 0.262392 0.076543
# 11 2012-01-02 BlackBerry 1465 0.693018 0.925737
# 13 2012-01-02 BlackBerry 1465 0.851842 0.345078
# 14 2012-01-02 BlackBerry 1465 0.706356 0.718341
# 17 2012-01-02 Carrot 500 0.072553 0.585774
# 15 2012-01-02 Tomato 700 0.911297 0.155700
# 16 2012-01-02 Tomato 235 0.118844 0.662083
df.groupby(by=['Date','Stock'], as_index=False, sort=False).head(3)
print df
# Date Stock Sales Data1 Data2
# 2 2012-01-01 Apple 340 0.141253 0.319522
# 1 2012-01-01 Apple 230 0.084699 0.564337
# 0 2012-01-01 Apple 120 0.996692 0.376607
# 4 2012-01-01 Berry 340 0.646634 0.635842
# 3 2012-01-01 Berry 230 0.506264 0.123658
# 5 2012-01-01 Cat 1250 0.204031 0.928828
# 6 2012-01-01 Cat 850 0.556935 0.810340
# 7 2012-01-01 Cat 650 0.771751 0.988848
# 9 2012-01-02 Apple 1065 0.504411 0.402553
# 10 2012-01-02 Apple 200 0.752335 0.487557
# 12 2012-01-02 BlackBerry 2000 0.262392 0.076543
# 11 2012-01-02 BlackBerry 1465 0.693018 0.925737
# 13 2012-01-02 BlackBerry 1465 0.851842 0.345078
# 17 2012-01-02 Carrot 500 0.072553 0.585774
# 15 2012-01-02 Tomato 700 0.911297 0.155700
# 16 2012-01-02 Tomato 235 0.118844 0.662083
Upvotes: 1