Reputation: 35
I have data as below:
Tran|Type|Amount|comment
1212|A|12|Buy
1212|AA|13|Buy
1212|CC|25|S
1213|AA|1112|B
1213|A|78|B
1213|CC|1190|SEllding
1214|AA|1112|B
1214|A|78|B
1214|CC|1190|SEllding
1215|AA|1112|B
1215|A|78|B
1216|AA|1112|B
....
I need to filter out all tran that have 3 type A, AA, CC and A.Amount + AA.Amount= CC.Amount
The data is huge (100M records)
My code is as below but it run quite slow
df1=df.groupby("tran").filter(lambda x: len(x) == 3)
listrefn=df1.tran.tolist()
df1=df[df.tran.isin(listrefn)]
df2=df1[df1.field=='A']
dfA=df2[['tran','Amount']]
df2=df1[df1.field=='AA']
dfAA=df2[['tran','Amount']]
df2=df1[df1.field=='CC']
dfCC=df2[['tran','Amount']]
dfA=dfA.rename(columns={'tran':'tran','Amount':'A'})
dfAA=dfAA.rename(columns={'tran':'tran','Amount':'AA'})
dfCC=dfCC.rename(columns={'tran':'tran','Amount':'CC'})
dftmp=pandas.merge(dfA,dfAA,how='left')
dftmp1=pandas.merge(dftmp,dfCC,how='left')
dftmp1['diff']=dftmp1.A-dftmp1.AA-dftmp1.CC
dftmp=dftmp1[['tran','diff']]
dftmp1=dftmp[dftmp['diff']==0]
Please help to advice
Upvotes: 3
Views: 77
Reputation: 862671
#If necessary filtering:
#df = df[df.groupby("Tran")['Type'].transform('size') == 3]
idx = df.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC').index
print (idx)
Int64Index([1212, 1213, 1214], dtype='int64', name='Tran')
df = df[df.Tran.isin(idx)]
#same as
#df = df.query('Tran in @idx')
print (df)
Tran Type Amount comment
0 1212 A 12 Buy
1 1212 AA 13 Buy
2 1212 CC 25 S
3 1213 AA 1112 B
4 1213 A 78 B
5 1213 CC 1190 SEllding
6 1214 AA 1112 B
7 1214 A 78 B
8 1214 CC 1190 SEllding
Another solution for filtering:
df = df.set_index('Tran').loc[idx].reset_index()
print (df)
Tran Type Amount comment
0 1212 A 12 Buy
1 1212 AA 13 Buy
2 1212 CC 25 S
3 1213 AA 1112 B
4 1213 A 78 B
5 1213 CC 1190 SEllding
6 1214 AA 1112 B
7 1214 A 78 B
8 1214 CC 1190 SEllding
Upvotes: 3
Reputation: 294278
Using set_index
. Nice thing is, A + AA == CC
won't happen unless all three are there so no need to check if all three are there.
df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC')
Type A AA CC
Tran
1212 12.0 13.0 25.0
1213 78.0 1112.0 1190.0
1214 78.0 1112.0 1190.0
You can get the subset of original with
t = df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC').index
df.query("Tran in @t")
# equivalently
# df[df.Tran.isin(t)]
Tran Type Amount comment
0 1212 A 12 Buy
1 1212 AA 13 Buy
2 1212 CC 25 S
3 1213 AA 1112 B
4 1213 A 78 B
5 1213 CC 1190 SEllding
6 1214 AA 1112 B
7 1214 A 78 B
8 1214 CC 1190 SEllding
Upvotes: 3
Reputation: 210842
UPDATE: looking at perfect @piRSquared's solution i've realized that we don't need to filter the source DF beforehand.
So this should be enough:
In [28]: df.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC')
Out[28]:
Type A AA CC
Tran
1212 12.0 13.0 25.0
1213 78.0 1112.0 1190.0
1214 78.0 1112.0 1190.0
OLD answer:
In [23]: x = df.groupby("Tran").filter(lambda x: len(x) == 3)
In [24]: x
Out[24]:
Tran Type Amount comment
0 1212 A 12 Buy
1 1212 AA 13 Buy
2 1212 CC 25 S
3 1213 AA 1112 B
4 1213 A 78 B
5 1213 CC 1190 SEllding
6 1214 AA 1112 B
7 1214 A 78 B
8 1214 CC 1190 SEllding
In [25]: x.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC')
Out[25]:
Type A AA CC
Tran
1212 12 13 25
1213 78 1112 1190
1214 78 1112 1190
Upvotes: 2