Reputation: 4608
i want to merge two dataframes by partial string match. I have two data frames to combine. First df1 consists of 130.000 rows like this:
id text xc1 xc2
1 adidas men shoes 52465 220
2 vakko men suits 49220 224
3 burberry men shirt 78248 289
4 prada women shoes 45780 789
5 lcwaikiki men sunglasses 34788 745
and second df2 consists of 8000 rows like this:
id keyword abc1 abc2
1 men shoes 1000 11
2 men suits 2000 12
3 men shirt 3000 13
4 women socks 4000 14
5 men sunglasses 5000 15
After matching between keyword and text, outputshould look like this:
id text xc1 xc2 keyword abc1 abc2
1 adidas men shoes 52465 220 men shoes 1000 11
2 vakko men suits 49220 224 men suits 2000 12
3 burberry men shirt 78248 289 men shirt 3000 13
4 lcwaikiki men sunglasses 34788 745 men sunglasses 5000 15
Upvotes: 1
Views: 1738
Reputation: 23217
Let's approach by cross join the 2 dataframes and then filter by matching string with substring, as follows:
df3 = df1.merge(df2, how='cross') # for Pandas version >= 1.2.0 (released in Dec 2020)
import re
mask = df3.apply(lambda x: (re.search(rf"\b{x['keyword']}\b", str(x['text']))) != None, axis=1)
df_out = df3.loc[mask]
If your Pandas version is older than 1.2.0 (released in Dec 2020) and does not support merge with how='cross'
, you can replace the merge statement with:
# For Pandas version < 1.2.0
df3 = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
After the cross join, we created a boolean mask to filter for the cases that keyword
is found within text
by using re.search
within .apply()
.
We have to use re.search
instead of simple Python substring test like stringA in stringB
found in most of the similar answers in StackOverflow. Such kind of test will fail with false match of 'men suits'
in keyword
with 'women suits'
in text
since it returns True
for test of 'men suits' in 'women suits'
.
We use regex with a pair of word boundary \b
meta-characters around the keyword
(regex pattern: rf"\b{x['keyword']}\b"
) to ensure matching only for whole word match for text
in df1
, i.e. men suits
in df2
would not match with women suits
in df1
since the word women
does not have a word boundary between the letters wo
and men
.
Result:
print(df_out)
id_x text xc1 xc2 id_y keyword abc1 abc2
0 1 adidas men shoes 52465 220 1 men shoes 1000 11
6 2 vakko men suits 49220 224 2 men suits 2000 12
12 3 burberry men shirt 78248 289 3 men shirt 3000 13
24 5 lcwaikiki men sunglasses 34788 745 5 men sunglasses 5000 15
Here, columns id_x
and id_y
are the original id
column in df1
and df2
respectively. As seen from the comment, these are just row numbers of the dataframes that you may not care about. We can then remove these 2 columns and reset index to clean up the layout:
df_out = df_out.drop(['id_x', 'id_y'], axis=1).reset_index(drop=True)
Final outcome
print(df_out)
text xc1 xc2 keyword abc1 abc2
0 adidas men shoes 52465 220 men shoes 1000 11
1 vakko men suits 49220 224 men suits 2000 12
2 burberry men shirt 78248 289 men shirt 3000 13
3 lcwaikiki men sunglasses 34788 745 men sunglasses 5000 15
Upvotes: 3
Reputation: 2293
Let's start by ordering the keywords longest-first, so that "women suits" matches "before "men suits"
lkeys = df2.keyword.reindex(df2.keyword.str.len().sort_values(ascending=False).index)
Now define a matching function; each text
value from df1
will be passed as s
to find a matching keyword:
def is_match(arr, s):
for a in arr:
if a in s:
return a
return None
Now we can extract the keyword from each text
in df1, and add it to a new column:
df1['keyword'] = df1['text'].apply(lambda x: is_match(lkeys, x))
We now have everything we need for a standard merge:
pd.merge(df1, df2, on='keyword')
Upvotes: 2