Reputation: 1454
I have written some code to map the ids of two dataframes and if a condition matches then create a count in a specified column in the existing dataframe, I am looking for a more efficient way of calculating it.
Sample Data
import numpy as np
import pandas as pd
d = {'ID' : pd.Series([111, 222, 111, 444, 222, 111]), 'Tag' : pd.Series([1, 2, 3, 1, 2, 1])}
df1 = (pd.DataFrame(d))
print(df1)
ID Tag
0 111 1
1 222 2
2 111 3
3 444 1
4 222 2
5 111 1
d = {'ID' : pd.Series([111, 444, 666, 444, 777])}
df2 = (pd.DataFrame(d))
print(df2)
ID
0 111
1 444
2 666
3 444
4 777
df2['tag1'] = 0
df2['tag2'] = 0
df2['tag3'] = 0
for index, row in df2.iterrows():
for i, t in df1.iterrows():
if row['ID'] == t['ID']:
if t['Tag'] == 1:
df2.loc[index]["tag1"] += 1
elif t['Tag'] == 2:
df2.loc[index]["tag2"] += 1
elif t['Tag'] == 3:
df2.loc[index]["tag3"] += 1
Output
print(df2)
ID tag1 tag2 tag3
0 111 2 0 1
1 444 1 0 0
2 666 0 0 0
3 444 1 0 0
4 777 0 0 0
What is the most efficient way of doing this, rather than computing iteratively?
Note, df1 can contain the sample ID
multiple times with a different value of Tag
(df1 and df2 are large dataframes, with 50,000 rows in df1 and 15,000 in df2)
Upvotes: 1
Views: 897
Reputation: 862511
You can use crosstab
with merge
:
print (pd.crosstab(df1.ID, df1.Tag))
Tag 1 2 3
ID
111 2 0 1
222 0 2 0
444 1 0 0
print (pd.merge(df2, pd.crosstab(df1.ID, df1.Tag)
.add_prefix('tag')
.reset_index(), on='ID', how='left')
.fillna(0)
.astype(int))
ID tag1 tag2 tag3
0 111 2 0 1
1 444 1 0 0
2 666 0 0 0
3 444 1 0 0
4 777 0 0 0
Instead crosstab
you can use groupby
with size
and unstack
:
print (df1.groupby(['ID', 'Tag'])['Tag'].size().unstack())
Tag 1 2 3
ID
111 2.0 NaN 1.0
222 NaN 2.0 NaN
444 1.0 NaN NaN
print (pd.merge(df2, df1.groupby(['ID', 'Tag'])['Tag'].size().unstack()
.add_prefix('tag')
.reset_index(), on='ID', how='left')
.fillna(0)
.astype(int))
ID tag1 tag2 tag3
0 111 2 0 1
1 444 1 0 0
2 666 0 0 0
3 444 1 0 0
4 777 0 0 0
Upvotes: 2