Reputation: 21
I am trying to perform left join/right join on partial string match in python. There have been many questions related to this on StackOverflow but I cannot apply them in my case. Hope that someone can guide me in the right direction. I have the following two pandas dataframes:
df1 = pd.DataFrame({'BName': ['10123', '10345', 'apple2', 'apple3', 'orange'],
'Description': ['aa', 'bb', 'cc', 'dd','ee']})
df2 = pd.DataFrame({'RefName': ['10345haha', 'bb10123', 'apple2hre', 'orangenono',"bye"],
'Type': ['01', '03', '04', '02','05'],
'Amount': ['250', '275', '260', '280','107'],
'Comment': ['bla', 'bla', 'bla', 'bla','bla']})
print(df1)
BName Description
0 10123 aa
1 10345 bb
2 apple2 cc
3 apple3 dd
4 orange ee
print(df2)
RefName Type Amount Comment
0 10345haha 01 250 bla
1 bb10123 03 275 bla
2 apple2hre 04 260 bla
3 orangenono 02 280 bla
4 bye 05 107 bla
I want to retain the main information in df2 while adding the information from df1. However, the left join/right join function does not work in my case because the names are not exactly the same.
My desired output is like this:
RefName Type Amount Comment Description
0 10345haha 01 250 bla bb
1 bb10123 03 275 bla aa
2 apple2hre 04 260 bla cc
3 orangenono 02 280 bla ee
4 bye 05 107 bla None
https://stackoverflow.com/questions/50983398/pandas-join-on-partial-string-match-like-excel-vlookup Inspired by the above post, I tried to do something like this:
df4 = df2.copy()
df4['BName'] = [val for idx, val in enumerate(df1['BName']) if val in df2['RefName'][idx]]
df_m4 = df1.merge(df4, how='right', on='BName')
But then this error arises,
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
384 try:
--> 385 return self._range.index(new_key)
386 except ValueError as err:
ValueError: 12957 is not in range
I am not sure if there is any other possible ways to solve the problem (not necessarily by solving the error above). Thanks in advance.
Upvotes: 2
Views: 966
Reputation: 71687
Create a regex pattern using the values from BName
column, then use str.extract
to extract the occurrence of this regex pattern from the values in RefName
column to create a new column BName
in df2
then left merge df2
with df1
to get the result
rpat = r'(%s)' % '|'.join(df1['BName'])
df2['BName'] = df2['RefName'].str.extract(rpat)
df2.merge(df1, how='left')
RefName Type Amount Comment BName Description
0 10345haha 01 250 bla 10345 bb
1 bb10123 03 275 bla 10123 aa
2 apple2hre 04 260 bla apple2 cc
3 orangenono 02 280 bla orange ee
4 bye 05 107 bla NaN NaN
Upvotes: 2