Join Pandas DataFrames matching by string and substring

Question

i want to merge two dataframes by partial string match. I have two data frames to combine. First df1 consists of 130.000 rows like this:

id    text                        xc1       xc2
1     adidas men shoes            52465     220
2     vakko men suits             49220     224
3     burberry men shirt          78248     289
4     prada women shoes           45780     789
5     lcwaikiki men sunglasses    34788     745

and second df2 consists of 8000 rows like this:

id    keyword               abc1     abc2
1     men shoes             1000     11
2     men suits             2000     12
3     men shirt             3000     13
4     women socks           4000     14
5     men sunglasses        5000     15

After matching between keyword and text, outputshould look like this:

id    text                        xc1       xc2   keyword         abc1  abc2
1     adidas men shoes            52465     220   men shoes       1000  11
2     vakko men suits             49220     224   men suits       2000  12
3     burberry men shirt          78248     289   men shirt       3000  13
4     lcwaikiki men sunglasses    34788     745   men sunglasses  5000  15

SeaBean · Accepted Answer

Let's approach by cross join the 2 dataframes and then filter by matching string with substring, as follows:

df3 = df1.merge(df2, how='cross')    # for Pandas version >= 1.2.0 (released in Dec 2020)

import re
mask = df3.apply(lambda x: (re.search(rf"\b{x['keyword']}\b", str(x['text']))) != None, axis=1)
df_out = df3.loc[mask]

If your Pandas version is older than 1.2.0 (released in Dec 2020) and does not support merge with how='cross', you can replace the merge statement with:

# For Pandas version < 1.2.0
df3 = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)

After the cross join, we created a boolean mask to filter for the cases that keyword is found within text by using re.search within .apply().

We have to use re.search instead of simple Python substring test like stringA in stringB found in most of the similar answers in StackOverflow. Such kind of test will fail with false match of 'men suits' in keyword with 'women suits' in text since it returns True for test of 'men suits' in 'women suits'.

We use regex with a pair of word boundary \b meta-characters around the keyword (regex pattern: rf"\b{x['keyword']}\b") to ensure matching only for whole word match for text in df1, i.e. men suits in df2 would not match with women suits in df1 since the word women does not have a word boundary between the letters wo and men.

Result:

print(df_out)


    id_x                      text    xc1  xc2  id_y         keyword  abc1  abc2
0      1          adidas men shoes  52465  220     1       men shoes  1000    11
6      2           vakko men suits  49220  224     2       men suits  2000    12
12     3        burberry men shirt  78248  289     3       men shirt  3000    13
24     5  lcwaikiki men sunglasses  34788  745     5  men sunglasses  5000    15

Here, columns id_x and id_y are the original id column in df1 and df2 respectively. As seen from the comment, these are just row numbers of the dataframes that you may not care about. We can then remove these 2 columns and reset index to clean up the layout:

df_out = df_out.drop(['id_x', 'id_y'], axis=1).reset_index(drop=True)

Final outcome

print(df_out)


                       text    xc1  xc2         keyword  abc1  abc2
0          adidas men shoes  52465  220       men shoes  1000    11
1           vakko men suits  49220  224       men suits  2000    12
2        burberry men shirt  78248  289       men shirt  3000    13
3  lcwaikiki men sunglasses  34788  745  men sunglasses  5000    15

Join Pandas DataFrames matching by string and substring

Answers (2)

Related Questions