Reputation: 81
I am wondering if there a fast way to merge two pandas tables by the regular expression in python .
For example: table A
col1 col2
1 apple_3dollars_5
2 apple_2dollar_4
1 orange_5dollar_3
1 apple_1dollar_3
table B
col1 col2
good (apple|oragne)_\dollars_5
bad .*_1dollar_.*
ok oragne_\ddollar_\d
Output:
col1 col2 col3
1 apple_3dollars_5 good
1 orange_5dollar_3 ok
1 apple_1dollar_3 bad
this is just an example, what I want is instead of merging by one col that exactly match, I want to join by some regular expression. Thank you!
Upvotes: 7
Views: 9187
Reputation: 2469
Similar to @MaxU, I use .replace
, but I replace the column of values that you want to merge on with the regex strings that they match on. Small warning that this can cause some issues like non-unique index if your normal text matches more than one regex pattern. So using your dataframe A and @MaxU's fixed regexes for dataframe B:
A['joinCol'] = A.col2.replace(B.col2, B.col2, regex=True)
B.rename({'col2': 'joinCol'}) # the join columns should have the same name
C = A.join(B, on='joinCol')
If you want, you can then drop that join column:
C = C.drop('joinCol', axis=1)
Upvotes: 0
Reputation: 86
I took the idea from https://python.tutorialink.com/can-i-perform-a-left-join-merge-between-two-dataframes-using-regular-expressions-with-pandas/ and improved it a little so that the original data can have more than one column and now we can make a real left join (merge) with regex!
import pandas as pd
d = {'extra_colum1': ['x', 'y', 'z', 'w'],'field': ['ab', 'a', 'cd', 'e'], 'extra_colum2': ['x', 'y', 'z', 'w']}
df = pd.DataFrame(data=d)
df_dict = pd.DataFrame(['a', 'b', 'c', 'd'], columns =
['destination'])
df_dict['field'] = '.*' + df_dict['destination'] + '.*'
df_dict.columns=['destination','field']
def merge_regex(df, df_dict, how, field):
import re
df_dict = df_dict.drop_duplicates()
idx = [(i,j) for i,r in enumerate(df_dict[f'{field}']) for j,v in enumerate(df[f'{field}']) if re.match(r,v)]
df_dict_idx, df_idx = zip(*idx)
t = df_dict.iloc[list(df_dict_idx),0].reset_index(drop=True)
t1 = df.iloc[list(df_idx),df.columns.get_loc(f'{field}')].reset_index(drop=True)
df_dict_translated = pd.concat([t,t1], axis=1)
data = pd.merge(
df,
df_dict_translated,
how=f'{how}',
left_on=f'{field}',
right_on=f'{field}'
)
data = data.drop_duplicates()
return data
Upvotes: 3
Reputation: 210922
First of all fix RegEx'es in the B
DataFrame:
In [222]: B
Out[222]:
col1 col2
0 good (apple|oragne)_\ddollars_5
1 bad .*_1dollar_.*
2 ok orange_\ddollar_\d
Now we can prepare the following variables:
In [223]: to_repl = B.col2.values.tolist()
In [224]: vals = B.col1.values.tolist()
In [225]: to_repl
Out[225]: ['(apple|oragne)_\\ddollars_5', '.*_1dollar_.*', 'orange_\\ddollar_\\d']
In [226]: vals
Out[226]: ['good', 'bad', 'ok']
Finally we can use them in the replace function:
In [227]: A['col3'] = A['col2'].replace(to_repl, vals, regex=True)
In [228]: A
Out[228]:
col1 col2 col3
0 1 apple_3dollars_5 good
1 2 apple_2dollar_4 apple_2dollar_4
2 1 orange_5dollar_3 ok
3 1 apple_1dollar_3 bad
Upvotes: 8