Paweł Czech
Paweł Czech

Reputation: 45

Find corresponding values in two DataFrames

I have to find corresponding values in the two Pandas DataFrame.

Input: df1:

   server    system     directions msgTYPE    msgID      count
0     1       sys1_in       in      ADT       MSG0001      1
1     1       sys1_in       in      ADT       MSG0002      1
2     1       sys1_in       in      ADT       MSG0003      1
3     1       sys1_in       in      ADT       MSG0004      1

df2:

   server    system     directions  msgTYPE     msgID      count
0     1       sys2_out       out      ADT       MSG0001      1
1     1       sys2_out       out      ADT       MSG0001      1
2     1       sys3_out       out      ADT       MSG0003      1
3     1       sys4_out       out      ADT       MSG0004      1

Output schould be:

     system_in       system_out        count
0     sys1_in         sys2_out           2
1     sys1_in         sys3_out           1
2     sys1_in         sys4_out           1


So I have to build from the two DF one df with the columns with in and out systems, there are with the mesgID correlated.

I am making it with the df.itertuples and df.goupby in the way:

model = pd.DataFrame(columns=['in', 'out', 'count'])
for item in ins.itertuples(index=True, name='Pandas'):
        selected = outs.query('msgID == "%s"' % (getattr(item, "msgID")))
        for row in selected.itertuples(index=True, name='Pandas2'):
            model = model.append({'in': getattr(item, "system"), 'out': getattr(row, "system"), 'count': 1},
                                 ignore_index=True)
result = model.groupby(['in', 'out'])['count'].sum().reset_index()



It works, but it is extream inefficient, the input Frames(df1, df2) have about 2 millions rows. Some one know more efficient way which is build in the Pandas framework?

Cheers.

Upvotes: 0

Views: 114

Answers (1)

Erfan
Erfan

Reputation: 42916

You can achieve this by first merging the dataframes by the corresponding columns and then using GroupBy with named_aggregations (new in pandas >= 0.25.0):

columns = [col for col in df1.columns if col != 'system']

mrg = df1.merge(df2, on=columns, suffixes=['_in', '_out'])

mrg.groupby(columns).agg(
    system_in=('system_in', 'first'),
    system_out=('system_out', 'first'),
    count=('system_in', 'size')
).reset_index(drop=True)

Output

  system_in system_out  count
0   sys1_in   sys2_out      2
1   sys1_in   sys3_out      1
2   sys1_in   sys4_out      1

If you want to keep the columns as information, just use merge and GroupBy.count:

df1.merge(df2, on=columns, suffixes=['_in', '_out'])\
   .groupby(columns, as_index=False).count()

Output

   server directions msgTYPE    msgID  count  system_in  system_out
0       1         in     ADT  MSG0001      1          2           2
1       1         in     ADT  MSG0003      1          1           1
2       1         in     ADT  MSG0004      1          1           1

Upvotes: 1

Related Questions