user18719682
user18719682

Reputation: 21

Pandas: left join/right join on partial string match

I am trying to perform left join/right join on partial string match in python. There have been many questions related to this on StackOverflow but I cannot apply them in my case. Hope that someone can guide me in the right direction. I have the following two pandas dataframes:

df1 = pd.DataFrame({'BName': ['10123', '10345', 'apple2', 'apple3', 'orange'],
                    'Description': ['aa', 'bb', 'cc', 'dd','ee']})
df2 = pd.DataFrame({'RefName': ['10345haha', 'bb10123', 'apple2hre', 'orangenono',"bye"],
                    'Type': ['01', '03', '04', '02','05'],
                    'Amount': ['250', '275', '260', '280','107'],
                    'Comment': ['bla', 'bla', 'bla', 'bla','bla']})

print(df1)
    BName Description
0   10123          aa
1   10345          bb
2  apple2          cc
3  apple3          dd
4  orange          ee

print(df2)
     RefName Type Amount Comment
0   10345haha   01    250     bla
1     bb10123   03    275     bla
2   apple2hre   04    260     bla
3  orangenono   02    280     bla
4         bye   05    107     bla

I want to retain the main information in df2 while adding the information from df1. However, the left join/right join function does not work in my case because the names are not exactly the same.

My desired output is like this:

      RefName  Type Amount Comment   Description
0   10345haha   01    250     bla  bb
1     bb10123   03    275     bla  aa
2   apple2hre   04    260     bla  cc
3  orangenono   02    280     bla  ee
4         bye   05    107     bla  None

https://stackoverflow.com/questions/50983398/pandas-join-on-partial-string-match-like-excel-vlookup Inspired by the above post, I tried to do something like this:

df4 = df2.copy()
df4['BName'] = [val for idx, val in enumerate(df1['BName']) if val in df2['RefName'][idx]]
df_m4 = df1.merge(df4, how='right', on='BName')

But then this error arises,

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/opt/anaconda3/lib/python3.9/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    384                 try:
--> 385                     return self._range.index(new_key)
    386                 except ValueError as err:

ValueError: 12957 is not in range

I am not sure if there is any other possible ways to solve the problem (not necessarily by solving the error above). Thanks in advance.

Upvotes: 2

Views: 966

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71687

Create a regex pattern using the values from BName column, then use str.extract to extract the occurrence of this regex pattern from the values in RefName column to create a new column BName in df2 then left merge df2 with df1 to get the result

rpat = r'(%s)' % '|'.join(df1['BName'])
df2['BName'] = df2['RefName'].str.extract(rpat)

df2.merge(df1, how='left')

      RefName Type Amount Comment   BName Description
0   10345haha   01    250     bla   10345          bb
1     bb10123   03    275     bla   10123          aa
2   apple2hre   04    260     bla  apple2          cc
3  orangenono   02    280     bla  orange          ee
4         bye   05    107     bla     NaN         NaN

Upvotes: 2

Related Questions