Reputation: 35

python pandas dataframe: need speed up process related to calculate 3 rows data

I have data as below:

Tran|Type|Amount|comment
1212|A|12|Buy
1212|AA|13|Buy
1212|CC|25|S
1213|AA|1112|B
1213|A|78|B
1213|CC|1190|SEllding
1214|AA|1112|B
1214|A|78|B
1214|CC|1190|SEllding
1215|AA|1112|B
1215|A|78|B
1216|AA|1112|B


....

I need to filter out all tran that have 3 type A, AA, CC and A.Amount + AA.Amount= CC.Amount

The data is huge (100M records)

My code is as below but it run quite slow

df1=df.groupby("tran").filter(lambda x: len(x) == 3)
listrefn=df1.tran.tolist()
df1=df[df.tran.isin(listrefn)]
df2=df1[df1.field=='A']
dfA=df2[['tran','Amount']]
df2=df1[df1.field=='AA']
dfAA=df2[['tran','Amount']]
df2=df1[df1.field=='CC']
dfCC=df2[['tran','Amount']]

dfA=dfA.rename(columns={'tran':'tran','Amount':'A'})
dfAA=dfAA.rename(columns={'tran':'tran','Amount':'AA'})
dfCC=dfCC.rename(columns={'tran':'tran','Amount':'CC'})

dftmp=pandas.merge(dfA,dfAA,how='left')
dftmp1=pandas.merge(dftmp,dfCC,how='left')
dftmp1['diff']=dftmp1.A-dftmp1.AA-dftmp1.CC
dftmp=dftmp1[['tran','diff']]
dftmp1=dftmp[dftmp['diff']==0]

Please help to advice

Upvotes: 3

Answers (3)

jezrael

Reputation: 862671

You can use pivot with query:

#If necessary filtering:
#df = df[df.groupby("Tran")['Type'].transform('size') == 3]

idx = df.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC').index
print (idx)
Int64Index([1212, 1213, 1214], dtype='int64', name='Tran')

df = df[df.Tran.isin(idx)]
#same as
#df = df.query('Tran in @idx')
print (df)
   Tran Type  Amount   comment
0  1212    A      12       Buy
1  1212   AA      13       Buy
2  1212   CC      25         S
3  1213   AA    1112         B
4  1213    A      78         B
5  1213   CC    1190  SEllding
6  1214   AA    1112         B
7  1214    A      78         B
8  1214   CC    1190  SEllding

Another solution for filtering:

df = df.set_index('Tran').loc[idx].reset_index()
print (df)
   Tran Type  Amount   comment
0  1212    A      12       Buy
1  1212   AA      13       Buy
2  1212   CC      25         S
3  1213   AA    1112         B
4  1213    A      78         B
5  1213   CC    1190  SEllding
6  1214   AA    1112         B
7  1214    A      78         B
8  1214   CC    1190  SEllding

Upvotes: 3

piRSquared

Reputation: 294278

Using set_index. Nice thing is, A + AA == CC won't happen unless all three are there so no need to check if all three are there.

df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC')

Type     A      AA      CC
Tran                      
1212  12.0    13.0    25.0
1213  78.0  1112.0  1190.0
1214  78.0  1112.0  1190.0

You can get the subset of original with

t = df.set_index(['Tran', 'Type']).Amount.unstack().query('A + AA == CC').index
df.query("Tran in @t")
# equivalently
# df[df.Tran.isin(t)]

   Tran Type  Amount   comment
0  1212    A      12       Buy
1  1212   AA      13       Buy
2  1212   CC      25         S
3  1213   AA    1112         B
4  1213    A      78         B
5  1213   CC    1190  SEllding
6  1214   AA    1112         B
7  1214    A      78         B
8  1214   CC    1190  SEllding

Upvotes: 3

MaxU - stand with Ukraine

Reputation: 210842

UPDATE: looking at perfect @piRSquared's solution i've realized that we don't need to filter the source DF beforehand.

So this should be enough:

In [28]: df.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC')
Out[28]:
Type     A      AA      CC
Tran
1212  12.0    13.0    25.0
1213  78.0  1112.0  1190.0
1214  78.0  1112.0  1190.0

OLD answer:

In [23]: x = df.groupby("Tran").filter(lambda x: len(x) == 3)

In [24]: x
Out[24]:
   Tran Type  Amount   comment
0  1212    A      12       Buy
1  1212   AA      13       Buy
2  1212   CC      25         S
3  1213   AA    1112         B
4  1213    A      78         B
5  1213   CC    1190  SEllding
6  1214   AA    1112         B
7  1214    A      78         B
8  1214   CC    1190  SEllding

In [25]: x.pivot(index='Tran', columns='Type', values='Amount').query('A + AA == CC')
Out[25]:
Type   A    AA    CC
Tran
1212  12    13    25
1213  78  1112  1190
1214  78  1112  1190

Upvotes: 2

python pandas dataframe: need speed up process related to calculate 3 rows data

Answers (3)

Related Questions