Reputation: 55
I am trying to perform a left merge using regular expressions in Python that can handle many-to-many relationships. Example:
df1 = pd.DataFrame(['a','b','c','d'], columns = ['col1'])
df1['regex'] = '.*' + df1['col1'] + '.*'
col1 regex
0 a .*a.*
1 b .*b.*
2 c .*c.*
3 d .*d.*
df2 = pd.DataFrame(['ab','a','cd'], columns = ['col2'])
col2
0 ab
1 a
2 cd
# Merge on regex column to col2
out = pd.DataFrame([['a','ab'],['a','a'],['b','ab'],['c','cd'],
['d','cd']],columns = ['col1','col2'])
col1 col2
0 a ab
1 a a
2 b ab
3 c cd
4 d cd
Upvotes: 3
Views: 1198
Reputation: 20659
You can use create a custom function to find all the matching indexes of both the data frames then extract those indexes and use pd.concat
.
import re
def merge_regex(df1, df2):
idx = [(i,j) for i,r in enumerate(df1.regex) for j,v in enumerate(df2.col2) if re.match(r,v)]
df1_idx, df2_idx = zip(*idx)
t = df1.iloc[list(df1_idx),0].reset_index(drop=True)
t1 = df2.iloc[list(df2_idx),0].reset_index(drop=True)
return pd.concat([t,t1],axis=1)
merge_regex(df1, df2)
col1 col2
0 a ab
1 a a
2 b ab
3 c cd
4 d cd
Timeit results
# My solution
In [292]: %timeit merge_regex(df1,df2)
1.21 ms ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#Chris's solution
In [293]: %%timeit
...: df1['matches'] = df1.apply(lambda r: [x for x in df2['col2'].values if re.findall(r['regex'], x)], axis=1)
...:
...: df1.set_index('col1').explode('matches').reset_index().drop(columns=['regex'])
...:
...:
4.62 ms ± 25.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 2
Reputation: 16172
I'd look at each col1 value and find all of the values in col2 of df2 that the pattern matches and make a list and place it into a column called 'matches'. Once that's done you can set the col1 as index, explode the matches column to make them into one per row, then get rid of the regex pattern column.
import re
import pandas as pd
df1 = pd.DataFrame(['a','b','c','d'], columns = ['col1'])
df1['regex'] = '.*' + df1['col1'] + '.*'
df2 = pd.DataFrame(['ab','a','cd'], columns = ['col2'])
df1['matches'] = df1.apply(lambda r: [x for x in df2['col2'].values if re.findall(r['regex'], x)], axis=1)
df1.set_index('col1').explode('matches').reset_index().drop(columns=['regex'])
col1 matches
0 a ab
1 a a
2 b ab
3 c cd
4 d cd
Upvotes: 1