Woody Pride
Woody Pride

Reputation: 13965

Passing multiple columns as arguments to aggregation function groupby

I am still struggling to get really familiar with the pandas groupby operations. When passing a function to agg, what if the aggregation function that is passed needs to consider values in columns other than those that are being aggregated.

Consider the following data frame example which is lists sales of products by two salesmen:

DateList = np.array( [(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 2, 2, 3, 4, 4, 5]] + \
[(datetime.date.today() - datetime.timedelta(7)) + datetime.timedelta(days = x) for x in [1, 1, 2, 3, 4, 5, 5]]
Names = np.array(['Joe' for x in xrange(7)] + ['John' for x in xrange(7)])
Product = np.array(['Product1', 'Product1', 'Product2', 'Product2', 'Product2', 'Product3', 'Product3', \
                    'Product1', 'Product2', 'Product2', 'Product2', 'Product2', 'Product2', 'Product3'])
Volume = np.array([100, 0, 150, 175, 15, 120, 150, 75, 0, 115, 130, 135, 10, 120])
Prices = {'Product1' : 25.99, 'Product2': 13.99, 'Product3': 8.99}
SalesDF = DataFrame({'Date' : DateLists, 'Seller' : Names, 'Product' : Product, 'Volume' : Volume})
SalesDF.sort(['Date', 'Seller'], inplace = True)
SalesDF['Prices'] = SalesDF.Product.map(Prices)

On some days each seller sells more than one item. Suppose you wanted to aggregate the data set into a single day/seller observations, and you wished to do so based upon which product sold the most volume. To be clear, this would be simple for the volume measure, simply pass a max function to agg. However for evaluating which Product and Price would remain would mean determining which volume measure was highest and then returning the value tha corresponds with that max value.

I am able to get the result I want by using the index values in the column that is passed to the function when agg is called and referencing the underlying data frame:

def AggFunc(x, df, col1):
        #Create list of index values that index the data in the column passed as x
    IndexVals = list(x.index)

        #Use those index values to create a list of the values of col1 in those index positions in the underlying data frame. 
    ColList = list(df[col1][IndexVals])

        # Find the max value of the list of values of col1
    MaxVal = np.max(ColList)

        # Find the index value of the max value of the list of values of col1
    MaxValIndex = ColList.index(MaxVal)

        #Return the data point in the list of data passed as column x which correspond to index value of the the max value of the list of col1 data
    return list(x)[MaxValIndex]

FunctionDict = {'Product': lambda x : AggFunc(x, SalesDF, 'Volume'), 'Volume' : 'max',\
'Prices': lambda x : AggFunc(x, SalesDF, 'Volume')}

SalesDF.groupby(['Date', "Seller"], as_index = False).agg(FunctionDict)

But I'm wondering if there is a better way where I can pass 'Volume' as an argument to the function that aggregates Product without having to get the index values and create lists from the data in the underlying dataframe? Something tells me no, as agg passes each column as a series to the aggregation function, rather than the dataframe itself.

Any ideas?

Thanks

Upvotes: 0

Views: 848

Answers (1)

DSM
DSM

Reputation: 353419

Maybe extracting the right indices first using .idxmax would be simpler?

>>> grouped = SalesDF.groupby(["Date", "Seller"])["Volume"]
>>> max_idx = grouped.apply(pd.Series.idxmax)
>>> SalesDF.loc[max_idx]
          Date   Product Seller  Volume  Prices
0   2013-11-04  Product1    Joe     100   25.99
7   2013-11-04  Product1   John      75   25.99
2   2013-11-05  Product2    Joe     150   13.99
9   2013-11-05  Product2   John     115   13.99
3   2013-11-06  Product2    Joe     175   13.99
10  2013-11-06  Product2   John     130   13.99
5   2013-11-07  Product3    Joe     120    8.99
11  2013-11-07  Product2   John     135   13.99
6   2013-11-08  Product3    Joe     150    8.99
13  2013-11-08  Product3   John     120    8.99

idxmax gives the index of the first occurrence of the maximum value. If you want to keep multiple products if they all obtain the maximum volume, it'd be a little different, something more like

>>> max_vols = SalesDF.groupby(["Date", "Seller"])["Volume"].transform(max)
>>> SalesDF[SalesDF.Volume == max_vols]

Upvotes: 1

Related Questions