Pandas, map two dataframes, count based on condition

Question

I have written some code to map the ids of two dataframes and if a condition matches then create a count in a specified column in the existing dataframe, I am looking for a more efficient way of calculating it.

Sample Data

import numpy as np
import pandas as pd
d = {'ID' : pd.Series([111, 222, 111, 444, 222, 111]), 'Tag' : pd.Series([1, 2, 3, 1, 2, 1])}
df1 = (pd.DataFrame(d))
print(df1) 

    ID  Tag
0  111    1
1  222    2
2  111    3
3  444    1
4  222    2
5  111    1

d = {'ID' : pd.Series([111, 444, 666, 444, 777])}
df2 = (pd.DataFrame(d))
print(df2)    
    ID
0  111
1  444
2  666
3  444
4  777

df2['tag1'] = 0
df2['tag2'] = 0
df2['tag3'] = 0

for index, row in df2.iterrows():
    for i, t in df1.iterrows():
        if row['ID'] == t['ID']:
            if t['Tag'] == 1:
                df2.loc[index]["tag1"] += 1
            elif t['Tag'] == 2:
                df2.loc[index]["tag2"] += 1
            elif t['Tag'] == 3:
                df2.loc[index]["tag3"] += 1

Output

print(df2)
    ID  tag1  tag2  tag3
0  111     2     0     1
1  444     1     0     0
2  666     0     0     0
3  444     1     0     0
4  777     0     0     0

What is the most efficient way of doing this, rather than computing iteratively?

Note, df1 can contain the sample ID multiple times with a different value of Tag

(df1 and df2 are large dataframes, with 50,000 rows in df1 and 15,000 in df2)

jezrael · Accepted Answer

You can use crosstab with merge:

print (pd.crosstab(df1.ID, df1.Tag))   
Tag  1  2  3
ID          
111  2  0  1
222  0  2  0
444  1  0  0

print (pd.merge(df2, pd.crosstab(df1.ID, df1.Tag)
                       .add_prefix('tag')
                       .reset_index(), on='ID', how='left')
         .fillna(0)
         .astype(int))  

    ID  tag1  tag2  tag3
0  111     2     0     1
1  444     1     0     0
2  666     0     0     0
3  444     1     0     0
4  777     0     0     0

Instead crosstab you can use groupby with size and unstack:

print (df1.groupby(['ID', 'Tag'])['Tag'].size().unstack())   
Tag    1    2    3
ID                
111  2.0  NaN  1.0
222  NaN  2.0  NaN
444  1.0  NaN  NaN

print (pd.merge(df2, df1.groupby(['ID', 'Tag'])['Tag'].size().unstack()
                        .add_prefix('tag')
                        .reset_index(), on='ID', how='left')
         .fillna(0)
         .astype(int))  

    ID  tag1  tag2  tag3
0  111     2     0     1
1  444     1     0     0
2  666     0     0     0
3  444     1     0     0
4  777     0     0     0

Pandas, map two dataframes, count based on condition

Answers (1)

Related Questions