Vaibhav
Vaibhav

Reputation: 617

What is the efficient way to perform row wise match in pandas?

Assume 2 data frames (df_a and df_b). I want to traverse row-wise and check for an exact match in the value column. If a match is found, I want the index of the matched row to be added in df_a.

df_a

| Index | Name | Value |
|-------|------|-------|
| 1     | Bon  | 124   |
| 2     | Bon  | 412   |
| 3     | Jaz  | 634   |
| 4     | Cal  | 977   |
| 5     | Cal  | 412   |
| 6     | Bon  | 412   |

df_b

| Index | Name | Value |
|-------|------|-------|
| 1     | Cal  | 977   |
| 2     | Jaz  | 634   |
| 3     | Lan  | 650   |
| 4     | Bon  | 412   |

Expected Output df

| Index | Name | Value | Index_in_df_b |
|-------|------|-------|---------------|
| 1     | Bon  | 124   | Unmatched     |
| 2     | Bon  | 412   | 4             |
| 3     | Jaz  | 634   | 2             |
| 4     | Cal  | 977   | 1             |
| 5     | Cal  | 412   | Unmatched     |
| 6     | Bon  | 412   | Unmatched     |

Existing Solution:

Create a column --> df_a['Index_in_df_b'] = 'Unmatched'

Then I had 3 solutions:

Started using iterrows. This solution took a lot of time to process so we shifted to using .loc. This solution took about 20 minutes to process the data frames with over 7 columns and around 15000 rows in each of them. Then we started using .at. This seems to be by far the best way to process it. It took ~3 minutes to process the same data frame as mentioned above. This is the current solution.

for index_a in df_a.index:
    for index_b in df_b.index:
        if df_a.at[index_a,'Name'] == df_b.at[index_b, 'Name']:
            # Processing logic to check for value

I'm not sure if apply can be used since 2 data frames and their row-wise details are necessary and also not sure about vectorization methods. Is there a faster way to proceed with this problem or is the current solution apt for this?

Upvotes: 0

Views: 575

Answers (1)

jezrael
jezrael

Reputation: 862591

Use if need match Name and Value columns use DataFrame.merge with left join and convert index to column Index_in_df_b:

df2 = df_b.rename_axis('Index_in_df_b').reset_index()
df = df_a.merge(df2, on=['Name','Value'], how='left').fillna({'Index_in_df_b':'Unmatched'})
print (df)
  Name  Value Index_in_df_b
0  Bon    124     Unmatched
1  Bon    412             4
2  Jaz    634             2
3  Cal    977             1
4  Cal    412     Unmatched

If need match only by Value column output is different in sample data:

df2 = df_b.rename_axis('Index_in_df_b').reset_index()[['Index_in_df_b','Value']]
df = df_a.merge(df2, on='Value', how='left').fillna({'Index_in_df_b':'Unmatched'})
print (df)
  Name  Value Index_in_df_b
0  Bon    124     Unmatched
1  Bon    412             4
2  Jaz    634             2
3  Cal    977             1
4  Cal    412             4

If need match only by Name column output is different in sample data:

df2 = df_b.rename_axis('Index_in_df_b').reset_index()[['Index_in_df_b','Name']]
df = df_a.merge(df2, on='Name', how='left').fillna({'Index_in_df_b':'Unmatched'})
print (df)
  Name  Value  Index_in_df_b
0  Bon    124              4
1  Bon    412              4
2  Jaz    634              2
3  Cal    977              1
4  Cal    412              1

Upvotes: 2

Related Questions