Merge dataframes based on substrings

Question

I want to merge/join two large dataframes while the 'id' column the dataframe on the right is assumed to be substrings of the left 'id' column.

For illustration purposes:

import pandas as pd
import numpy as np

df1=pd.DataFrame({'id':['abc','adcfek','acefeasdq'],'numbers':[1,2,np.nan],'add_info':[3123,np.nan,312441]})
df2=pd.DataFrame({'matching':['adc','fek','acefeasdq','abcef','acce','dcf'],'needed_info':[1,2,3,4,5,6],'other_info':[22,33,11,44,55,66]})

This is df1:

          id  numbers  add_info
0        abc      1.0    3123.0
1     adcfek      2.0       NaN
2  acefeasdq      NaN  312441.0

And this is df2:

    matching  needed_info  other_info
0        adc            1          22
1        fek            2          33
2  acefeasdq            3          11
3      abcef            4          44
4       acce            5          55
5        dcf            6          66

And this is the desired output:

          id  numbers  add_info  needed_info  other_info
0        abc      1.0    3123.0          NaN         NaN
1     adcfek      2.0       NaN          2.0        33.0
2     adcfek      2.0       NaN          6.0        66.0
3  acefeasdq      NaN  312441.0          3.0        11.0

So as described, I only want to merge the additional columns only when the 'matching' column is a substring of the 'id' column. If it is the other way around, e.g. 'abc' is a substring of 'adcef', nothing should happen.

In my data, a lot of the matches between df1 and df2 are actually exact, like the 'acefeasdq' row. But there are cases where 'id's contain multiple 'matching's. For the moment, it is okish to ignore these cases but I'd like to learn how I can tackle this issue. And additionally, is it possible to mark out the rows that are merged based on substrings and the rows that are merged exactly?

codeape · Accepted Answer

You can use pd.merge(how='cross') to create a dataframe containing all combinations of the rows. And then filter the dataframe using a boolean series:

df = pd.merge(df1, df2, how="cross")
include_row = df.apply(lambda row: row.matching in row.id, axis=1)
filtered = df.loc[include_row]
print(filtered)

Docs:

Merge dataframes based on substrings

Answers (2)

Related Questions