Reputation: 13965
I am still struggling to get really familiar with the pandas groupby
operations. When passing a function to agg
, what if the aggregation function that is passed needs to consider values in columns other than those that are being aggregated.
Consider the following data frame example which is lists sales of products by two salesmen:
DateList = np.array( [(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 2, 2, 3, 4, 4, 5]] + \
[(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 1, 2, 3, 4, 5, 5]]
Names = np.array(['Joe' for x in xrange(7)] + ['John' for x in xrange(7)])
Product = np.array(['Product1', 'Product1', 'Product2', 'Product2', 'Product2', 'Product3', 'Product3', \
'Product1', 'Product2', 'Product2', 'Product2', 'Product2', 'Product2', 'Product3'])
Volume = np.array([100, 0, 150, 175, 15, 120, 150, 75, 0, 115, 130, 135, 10, 120])
Prices = {'Product1' : 25.99, 'Product2': 13.99, 'Product3': 8.99}
SalesDF = DataFrame({'Date' : DateLists, 'Seller' : Names, 'Product' : Product, 'Volume' : Volume})
SalesDF.sort(['Date', 'Seller'], inplace = True)
SalesDF['Prices'] = SalesDF.Product.map(Prices)
On some days each seller sells more than one item. Suppose you wanted to aggregate the data set into a single day/seller observations, and you wished to do so based upon which product sold the most volume. To be clear, this would be simple for the volume measure, simply pass a max function to agg. However for evaluating which Product and Price would remain would mean determining which volume measure was highest and then returning the value tha corresponds with that max value.
I am able to get the result I want by using the index values in the column that is passed to the function when agg is called and referencing the underlying data frame:
def AggFunc(x, df, col1):
#Create list of index values that index the data in the column passed as x
IndexVals = list(x.index)
#Use those index values to create a list of the values of col1 in those index positions in the underlying data frame.
ColList = list(df[col1][IndexVals])
# Find the max value of the list of values of col1
MaxVal = np.max(ColList)
# Find the index value of the max value of the list of values of col1
MaxValIndex = ColList.index(MaxVal)
#Return the data point in the list of data passed as column x which correspond to index value of the the max value of the list of col1 data
return list(x)[MaxValIndex]
FunctionDict = {'Product': lambda x : AggFunc(x, SalesDF, 'Volume'), 'Volume' : 'max',\
'Prices': lambda x : AggFunc(x, SalesDF, 'Volume')}
SalesDF.groupby(['Date', "Seller"], as_index = False).agg(FunctionDict)
But I'm wondering if there is a better way where I can pass 'Volume' as an argument to the function that aggregates Product without having to get the index values and create lists from the data in the underlying dataframe? Something tells me no, as agg
passes each column as a series to the aggregation function, rather than the dataframe itself.
Any ideas?
Thanks
Upvotes: 0
Views: 848
Reputation: 353419
Maybe extracting the right indices first using .idxmax
would be simpler?
>>> grouped = SalesDF.groupby(["Date", "Seller"])["Volume"]
>>> max_idx = grouped.apply(pd.Series.idxmax)
>>> SalesDF.loc[max_idx]
Date Product Seller Volume Prices
0 2013-11-04 Product1 Joe 100 25.99
7 2013-11-04 Product1 John 75 25.99
2 2013-11-05 Product2 Joe 150 13.99
9 2013-11-05 Product2 John 115 13.99
3 2013-11-06 Product2 Joe 175 13.99
10 2013-11-06 Product2 John 130 13.99
5 2013-11-07 Product3 Joe 120 8.99
11 2013-11-07 Product2 John 135 13.99
6 2013-11-08 Product3 Joe 150 8.99
13 2013-11-08 Product3 John 120 8.99
idxmax
gives the index of the first occurrence of the maximum value. If you want to keep multiple products if they all obtain the maximum volume, it'd be a little different, something more like
>>> max_vols = SalesDF.groupby(["Date", "Seller"])["Volume"].transform(max)
>>> SalesDF[SalesDF.Volume == max_vols]
Upvotes: 1