mysterious_guy
mysterious_guy

Reputation: 435

python Pandas calling a complex function in groupby.agg

Below is my dataframe

    Txn_Key Send_Agent           Send_Time            Pay_Time  Send_Amount  \
0         NaN  ANO080012 2012-05-31 02:25:00 2012-05-31 21:43:00       490.00
1         NaN  AUK359401 2012-05-31 11:25:00 2012-05-31 11:57:00       616.16
2         NaN  ACL000105 2012-05-31 13:07:00 2012-05-31 17:36:00       193.78
3         NaN  AED420319 2012-05-31 10:50:00 2012-05-31 11:34:00       999.43
4         NaN  ARA030210 2012-05-30 12:14:00 2012-05-31 04:16:00       433.29
5         NaN  AJ5020114 2012-05-31 02:37:00 2012-05-31 04:31:00       378.00
6         NaN  A11171047 2012-05-31 09:39:00 2012-05-31 10:08:00       865.34
  Pay_Amount        MTCN      Send_Phone  Refund_Flag       time_diff
0         475.68  9323625903        97549829          NaN 0 days 19:18:00
1         600.87  3545067820    440000000000          NaN 0 days 00:32:00
2         185.21  1453132764            0511          NaN 0 days 04:29:00
3         963.04  4509062067    971566016900          NaN 0 days 00:44:00
4         423.75  6898279087             144          NaN 0 days 16:02:00
5         377.99  5170985243    963954932506          NaN 0 days 01:54:00
6         833.89  5352719100      0644798854          NaN 0 days 00:29:00

So , I need a count when Send_Amount is same for the next row. A groupby apply using a lambda works perfectly fine:

txn1 = txns.loc[:,['Send_Agent','Send_Amount']]
 Send_repeat_count =  txn1.groupby('Send_Agent').apply(lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum()

....: )

But a similar lambda function did not work in groupby.agg .

grouped=txn.groupby('Send_Agent')

x=grouped.agg({'Send_Amount':'mean','Pay_Amount':'mean','time_diff':'min','MTCN':'size','Send_Phone':'nunique','Refund_Flag':'count','Send_Amount':'lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum()'})

AttributeError: 'Series' object has no attribute 'Send_Amount'

So , I wrote a separate function to do the same and called it in my groupby.agg

 def repeat_count(x):
if x==x.shift():
 ....:         cumsum()


x = grouped.agg({'Send_Amount':'mean','Pay_Amount':'mean','time_diff':'min','MTCN':'size','Send_Phone':'nunique','Refund_Flag':'count','Send_Amount':repeat_count(x)})

     ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

If cumsum can work fine with group by.apply why it does not work inside a function.

Upvotes: 1

Views: 2332

Answers (1)

unutbu
unutbu

Reputation: 880459

In general, the Send_Agent column is going to contain duplicates (otherwise, there would be no point in grouping by Send_Agent). Moreover, (x==x.shift()).cumsum() will return a Series with as many rows as there are duplicates in each Send_Agent group.

df.groupby(...).agg(func) requires that func return a scalar (such as a float). func is not allowed to return a Series. (In contrast, func can return a Series or even DataFrame when df.groupby(...).apply(func) is used.)


If you want to count the number of adjacent rows in a group which are equal you could use sum() instead of cumsum(). For example,

import numpy as np
import pandas as pd
pd.options.display.width = 1000
nan = np.nan
txn = pd.DataFrame(
    {'MTCN': [0, 9323625903, 3545067820, 1453132764, 4509062067, 6898279087, 5170985243, 5352719100], 
     'Pay_Amount': [1, 475.68, 600.87, 185.21, 963.04, 423.75, 377.99, 833.89],
     'Pay_Time': ['2012-05-31 10:08:00', '2012-05-31 21:43:00', '2012-05-31 11:57:00', '2012-05-31 17:36:00', 
                  '2012-05-31 11:34:00', '2012-05-31 04:16:00', '2012-05-31 04:31:00', 
                  '2012-05-31 10:08:00'], 
     'Refund_Flag': [nan, nan, nan, nan, nan, nan, nan, nan], 
     'Send_Amount': [865.34, 490.0, 616.16, 193.78, 999.43, 433.29, 378.0, 865.34],
     'Send_Phone': [3, 97549829, 440000000000, 511, 971566016900, 144, 963954932506, 644798854],
     'Send_Time': ['2012-05-31 09:39:00', '2012-05-31 02:25:00', '2012-05-31 11:25:00', '2012-05-31 13:07:00', 
                   '2012-05-31 10:50:00', '2012-05-30 12:14:00', '2012-05-31 02:37:00', 
                   '2012-05-31 09:39:00'], 
     'Txn_Key': [nan, nan, nan, nan, nan, nan, nan, nan],
     'Send_Agent': ['A11171047', 'ANO080012', 'AUK359401', 'ACL000105', 'AED420319', 
                    'ARA030210', 'AJ5020114', 'A11171047'], 
     'time_diff': ['0 days 00:29:00', '0 days 19:18:00', '0 days 00:32:00', '0 days 04:29:00', 
                   '0 days 00:44:00', '0 days 16:02:00', '0 days 01:54:00', 
                   '0 days 00:29:00', ]} )
txn['time_diff'] = pd.to_timedelta(txn['time_diff']) 

grouped = txn.groupby('Send_Agent')

def repeat_count(s):
    return (s.shift() == s).sum()

result = grouped.agg(
    {'Pay_Amount':'mean',
     'time_diff':'min',
     'MTCN':'size',
     'Send_Phone':'nunique',
     'Refund_Flag':'count',
     'Send_Amount': ['mean', repeat_count]})
print(result)

yields

           Refund_Flag       time_diff Send_Phone MTCN Send_Amount              Pay_Amount
                 count             min    nunique size        mean repeat_count       mean
Send_Agent                                                                                
A11171047            0   1740000000000          2    2      865.34          1.0    417.445
ACL000105            0  16140000000000          1    1      193.78          0.0    185.210
AED420319            0   2640000000000          1    1      999.43          0.0    963.040
AJ5020114            0   6840000000000          1    1      378.00          0.0    377.990
ANO080012            0  69480000000000          1    1      490.00          0.0    475.680
ARA030210            0  57720000000000          1    1      433.29          0.0    423.750
AUK359401            0   1920000000000          1    1      616.16          0.0    600.870

(I added an extra row so that repeat_count does not always return 0.)


When you use DataFrame.groupby(...).apply(func), the objects passed to func are DataFrames. Thus,

txn1.groupby('Send_Agent').apply(
    lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum())

works because the txn1 inside the lambda is a DataFrame with a Send_Amount column.


In contrast, when you use DataFrame.groupby(...).agg({'col': func}), the objects passed to func are Series with values from the column specified by col. Thus

x = grouped.agg({'Send_Amount':'mean','Pay_Amount':'mean','time_diff':'min','MTCN':'size','Send_Phone':'nunique','Refund_Flag':'count','Send_Amount':lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum()})

raised AttributeError: 'Series' object has no attribute 'Send_Amount' because the Series passed to the lambda function (and bound to the variable txn1) has no Send_Amount attribute.


If you use something like repeat_count:

def repeat_count(x):
    if x==x.shift():
        return x.cumsum()

then if x==x.shift() raises

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

because x==x.shift() is a Series and if expression causes expression to be evaluated in a boolean context. That is, expression.__bool__() gets called. __bool__ must return either True or False or raise an exception. Thus, for if x==x.shift() to make sense, (x==x.shift()).__bool__() would have to return either True or False.

Series.__bool__() always raises the ValueError above because Pandas (by design) does not guess if it should return True when all the values in the Series are True or when any of the values are True, or when the Series is merely nonempty, etc.... The ValueError message points you in the right direction. Usually the problem is solved by being explicit about what boolean value you want by calling (x==x.shift()).any() or (x==x.shift()).all(), etc.


A note on performance: Generally, using groupby/agg with a custom function does not perform as well as groupby/agg with a built-in method like count or sum. So it usually pays to figure out a way (if possible) to express the calculation in terms of built-in methods. In this case, you could do a preparatory calculation on the whole DataFrame which would then allow you to use groupby/agg/sum:

txn = txn.sort_values(by='Send_Agent')
txn['repeat'] = ((txn['Send_Agent'].shift() == txn['Send_Agent']) 
                 & (txn['Send_Agent'].shift() == txn['Send_Agent']))

grouped = txn.groupby('Send_Agent')
result = grouped.agg(
    {'Pay_Amount':'mean',
     'time_diff':'min',
     'MTCN':'size',
     'Send_Phone':'nunique',
     'Refund_Flag':'count',
     'Send_Amount': 'mean',
     'repeat':'sum'})
print(result)

Upvotes: 2

Related Questions