Reputation: 435
Below is my dataframe
Txn_Key Send_Agent Send_Time Pay_Time Send_Amount \
0 NaN ANO080012 2012-05-31 02:25:00 2012-05-31 21:43:00 490.00
1 NaN AUK359401 2012-05-31 11:25:00 2012-05-31 11:57:00 616.16
2 NaN ACL000105 2012-05-31 13:07:00 2012-05-31 17:36:00 193.78
3 NaN AED420319 2012-05-31 10:50:00 2012-05-31 11:34:00 999.43
4 NaN ARA030210 2012-05-30 12:14:00 2012-05-31 04:16:00 433.29
5 NaN AJ5020114 2012-05-31 02:37:00 2012-05-31 04:31:00 378.00
6 NaN A11171047 2012-05-31 09:39:00 2012-05-31 10:08:00 865.34
Pay_Amount MTCN Send_Phone Refund_Flag time_diff
0 475.68 9323625903 97549829 NaN 0 days 19:18:00
1 600.87 3545067820 440000000000 NaN 0 days 00:32:00
2 185.21 1453132764 0511 NaN 0 days 04:29:00
3 963.04 4509062067 971566016900 NaN 0 days 00:44:00
4 423.75 6898279087 144 NaN 0 days 16:02:00
5 377.99 5170985243 963954932506 NaN 0 days 01:54:00
6 833.89 5352719100 0644798854 NaN 0 days 00:29:00
So , I need a count when Send_Amount is same for the next row. A groupby apply using a lambda works perfectly fine:
txn1 = txns.loc[:,['Send_Agent','Send_Amount']]
Send_repeat_count = txn1.groupby('Send_Agent').apply(lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum()
....: )
But a similar lambda function did not work in groupby.agg .
grouped=txn.groupby('Send_Agent')
x=grouped.agg({'Send_Amount':'mean','Pay_Amount':'mean','time_diff':'min','MTCN':'size','Send_Phone':'nunique','Refund_Flag':'count','Send_Amount':'lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum()'})
AttributeError: 'Series' object has no attribute 'Send_Amount'
So , I wrote a separate function to do the same and called it in my groupby.agg
def repeat_count(x):
if x==x.shift():
....: cumsum()
x = grouped.agg({'Send_Amount':'mean','Pay_Amount':'mean','time_diff':'min','MTCN':'size','Send_Phone':'nunique','Refund_Flag':'count','Send_Amount':repeat_count(x)})
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
If cumsum can work fine with group by.apply why it does not work inside a function.
Upvotes: 1
Views: 2332
Reputation: 880459
In general, the Send_Agent
column is going to contain duplicates (otherwise, there would be no point in grouping by Send_Agent
). Moreover, (x==x.shift()).cumsum()
will return a Series with as many rows as there are duplicates in each Send_Agent
group.
df.groupby(...).agg(func)
requires that func
return a scalar (such as a float). func
is not allowed to return a Series. (In contrast, func
can return a Series
or even DataFrame
when df.groupby(...).apply(func)
is used.)
If you want to count the number of adjacent rows in a group which are equal you could use sum()
instead of cumsum()
. For example,
import numpy as np
import pandas as pd
pd.options.display.width = 1000
nan = np.nan
txn = pd.DataFrame(
{'MTCN': [0, 9323625903, 3545067820, 1453132764, 4509062067, 6898279087, 5170985243, 5352719100],
'Pay_Amount': [1, 475.68, 600.87, 185.21, 963.04, 423.75, 377.99, 833.89],
'Pay_Time': ['2012-05-31 10:08:00', '2012-05-31 21:43:00', '2012-05-31 11:57:00', '2012-05-31 17:36:00',
'2012-05-31 11:34:00', '2012-05-31 04:16:00', '2012-05-31 04:31:00',
'2012-05-31 10:08:00'],
'Refund_Flag': [nan, nan, nan, nan, nan, nan, nan, nan],
'Send_Amount': [865.34, 490.0, 616.16, 193.78, 999.43, 433.29, 378.0, 865.34],
'Send_Phone': [3, 97549829, 440000000000, 511, 971566016900, 144, 963954932506, 644798854],
'Send_Time': ['2012-05-31 09:39:00', '2012-05-31 02:25:00', '2012-05-31 11:25:00', '2012-05-31 13:07:00',
'2012-05-31 10:50:00', '2012-05-30 12:14:00', '2012-05-31 02:37:00',
'2012-05-31 09:39:00'],
'Txn_Key': [nan, nan, nan, nan, nan, nan, nan, nan],
'Send_Agent': ['A11171047', 'ANO080012', 'AUK359401', 'ACL000105', 'AED420319',
'ARA030210', 'AJ5020114', 'A11171047'],
'time_diff': ['0 days 00:29:00', '0 days 19:18:00', '0 days 00:32:00', '0 days 04:29:00',
'0 days 00:44:00', '0 days 16:02:00', '0 days 01:54:00',
'0 days 00:29:00', ]} )
txn['time_diff'] = pd.to_timedelta(txn['time_diff'])
grouped = txn.groupby('Send_Agent')
def repeat_count(s):
return (s.shift() == s).sum()
result = grouped.agg(
{'Pay_Amount':'mean',
'time_diff':'min',
'MTCN':'size',
'Send_Phone':'nunique',
'Refund_Flag':'count',
'Send_Amount': ['mean', repeat_count]})
print(result)
yields
Refund_Flag time_diff Send_Phone MTCN Send_Amount Pay_Amount
count min nunique size mean repeat_count mean
Send_Agent
A11171047 0 1740000000000 2 2 865.34 1.0 417.445
ACL000105 0 16140000000000 1 1 193.78 0.0 185.210
AED420319 0 2640000000000 1 1 999.43 0.0 963.040
AJ5020114 0 6840000000000 1 1 378.00 0.0 377.990
ANO080012 0 69480000000000 1 1 490.00 0.0 475.680
ARA030210 0 57720000000000 1 1 433.29 0.0 423.750
AUK359401 0 1920000000000 1 1 616.16 0.0 600.870
(I added an extra row so that repeat_count
does not always return 0.)
When you use DataFrame.groupby(...).apply(func)
, the objects passed to func
are DataFrames. Thus,
txn1.groupby('Send_Agent').apply(
lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum())
works because the txn1
inside the lambda
is a DataFrame with a Send_Amount
column.
In contrast, when you use DataFrame.groupby(...).agg({'col': func})
, the objects passed to func
are Series with values from the column specified by col
. Thus
x = grouped.agg({'Send_Amount':'mean','Pay_Amount':'mean','time_diff':'min','MTCN':'size','Send_Phone':'nunique','Refund_Flag':'count','Send_Amount':lambda txn1 : (txn1.Send_Amount.shift() == txn1.Send_Amount).cumsum()})
raised AttributeError: 'Series' object has no attribute 'Send_Amount'
because the Series passed to the lambda
function (and bound to the variable txn1
) has no Send_Amount
attribute.
If you use something like repeat_count
:
def repeat_count(x):
if x==x.shift():
return x.cumsum()
then if x==x.shift()
raises
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
because x==x.shift()
is a Series and if expression
causes expression
to be evaluated in a boolean context. That is, expression.__bool__()
gets called. __bool__
must return either True or False or raise an exception. Thus, for if x==x.shift()
to make sense, (x==x.shift()).__bool__()
would have to return either True or False.
Series.__bool__()
always raises the ValueError
above because Pandas (by design) does not guess if it should return True when all the values in the Series are True or when any of the values are True, or when the Series is merely nonempty, etc.... The ValueError
message points you in the right direction. Usually the problem is solved by being explicit about what boolean value you want by calling (x==x.shift()).any()
or (x==x.shift()).all()
, etc.
A note on performance: Generally, using groupby/agg
with a custom function does not perform as well as groupby/agg
with a built-in method like count
or sum
. So it usually pays to figure out a way (if possible) to express the calculation in terms of built-in methods. In this case, you could do a preparatory calculation on the whole DataFrame which would then allow you to use groupby/agg/sum
:
txn = txn.sort_values(by='Send_Agent')
txn['repeat'] = ((txn['Send_Agent'].shift() == txn['Send_Agent'])
& (txn['Send_Agent'].shift() == txn['Send_Agent']))
grouped = txn.groupby('Send_Agent')
result = grouped.agg(
{'Pay_Amount':'mean',
'time_diff':'min',
'MTCN':'size',
'Send_Phone':'nunique',
'Refund_Flag':'count',
'Send_Amount': 'mean',
'repeat':'sum'})
print(result)
Upvotes: 2