Yieh Yan
Yieh Yan

Reputation: 39

Merge dataframes based on substrings

I want to merge/join two large dataframes while the 'id' column the dataframe on the right is assumed to be substrings of the left 'id' column.

For illustration purposes:

import pandas as pd
import numpy as np

df1=pd.DataFrame({'id':['abc','adcfek','acefeasdq'],'numbers':[1,2,np.nan],'add_info':[3123,np.nan,312441]})
df2=pd.DataFrame({'matching':['adc','fek','acefeasdq','abcef','acce','dcf'],'needed_info':[1,2,3,4,5,6],'other_info':[22,33,11,44,55,66]})

This is df1:

          id  numbers  add_info
0        abc      1.0    3123.0
1     adcfek      2.0       NaN
2  acefeasdq      NaN  312441.0

And this is df2:

    matching  needed_info  other_info
0        adc            1          22
1        fek            2          33
2  acefeasdq            3          11
3      abcef            4          44
4       acce            5          55
5        dcf            6          66

And this is the desired output:

          id  numbers  add_info  needed_info  other_info
0        abc      1.0    3123.0          NaN         NaN
1     adcfek      2.0       NaN          2.0        33.0
2     adcfek      2.0       NaN          6.0        66.0
3  acefeasdq      NaN  312441.0          3.0        11.0

So as described, I only want to merge the additional columns only when the 'matching' column is a substring of the 'id' column. If it is the other way around, e.g. 'abc' is a substring of 'adcef', nothing should happen.

In my data, a lot of the matches between df1 and df2 are actually exact, like the 'acefeasdq' row. But there are cases where 'id's contain multiple 'matching's. For the moment, it is okish to ignore these cases but I'd like to learn how I can tackle this issue. And additionally, is it possible to mark out the rows that are merged based on substrings and the rows that are merged exactly?

Upvotes: 0

Views: 65

Answers (2)

codeape
codeape

Reputation: 100816

You can use pd.merge(how='cross') to create a dataframe containing all combinations of the rows. And then filter the dataframe using a boolean series:

df = pd.merge(df1, df2, how="cross")
include_row = df.apply(lambda row: row.matching in row.id, axis=1)
filtered = df.loc[include_row]
print(filtered)

Docs:

Upvotes: 1

bracko
bracko

Reputation: 372

If your processing can handle CROSS JOIN (problematic with large datasets), then you could cross join and then delete/filter only those you want.

map= cross.apply(lambda x: str(x['matching']) in str(x['id']), axis=1) #create map of booleans
final = cross[map] #get only those where condition was met

Upvotes: 1

Related Questions