Reputation: 45
I have to find corresponding values in the two Pandas DataFrame.
Input: df1:
server system directions msgTYPE msgID count
0 1 sys1_in in ADT MSG0001 1
1 1 sys1_in in ADT MSG0002 1
2 1 sys1_in in ADT MSG0003 1
3 1 sys1_in in ADT MSG0004 1
df2:
server system directions msgTYPE msgID count
0 1 sys2_out out ADT MSG0001 1
1 1 sys2_out out ADT MSG0001 1
2 1 sys3_out out ADT MSG0003 1
3 1 sys4_out out ADT MSG0004 1
Output schould be:
system_in system_out count
0 sys1_in sys2_out 2
1 sys1_in sys3_out 1
2 sys1_in sys4_out 1
So I have to build from the two DF one df with the columns with in and out systems, there are with the mesgID correlated.
I am making it with the df.itertuples and df.goupby in the way:
model = pd.DataFrame(columns=['in', 'out', 'count'])
for item in ins.itertuples(index=True, name='Pandas'):
selected = outs.query('msgID == "%s"' % (getattr(item, "msgID")))
for row in selected.itertuples(index=True, name='Pandas2'):
model = model.append({'in': getattr(item, "system"), 'out': getattr(row, "system"), 'count': 1},
ignore_index=True)
result = model.groupby(['in', 'out'])['count'].sum().reset_index()
It works, but it is extream inefficient, the input Frames(df1, df2) have about 2 millions rows. Some one know more efficient way which is build in the Pandas framework?
Cheers.
Upvotes: 0
Views: 114
Reputation: 42916
You can achieve this by first merging
the dataframes by the corresponding columns and then using GroupBy
with named_aggregations
(new in pandas >= 0.25.0
):
columns = [col for col in df1.columns if col != 'system']
mrg = df1.merge(df2, on=columns, suffixes=['_in', '_out'])
mrg.groupby(columns).agg(
system_in=('system_in', 'first'),
system_out=('system_out', 'first'),
count=('system_in', 'size')
).reset_index(drop=True)
Output
system_in system_out count
0 sys1_in sys2_out 2
1 sys1_in sys3_out 1
2 sys1_in sys4_out 1
If you want to keep the columns as information, just use merge
and GroupBy.count
:
df1.merge(df2, on=columns, suffixes=['_in', '_out'])\
.groupby(columns, as_index=False).count()
Output
server directions msgTYPE msgID count system_in system_out
0 1 in ADT MSG0001 1 2 2
1 1 in ADT MSG0003 1 1 1
2 1 in ADT MSG0004 1 1 1
Upvotes: 1