Reputation: 10203
I have been struggling with this problem all week. I have two DataFrames as follows:
df1:
Account| ID | Name
--------------------------------------
B36363 | 2019001 | John
G47281 | 2019002;2018101 | Alice;Emma
H46291 | 2019001 | John
df2:
Account | Col_B | Col_C
-----------------------------
B36363-0 | text_b1 | text_c1
01_G47281 | text_b2 | text_c2
X_H46291 | text_b3 | text_c3
II_G47281 | text_b4 | text_C4
I want to merge these DataFrames on Account when df2.Account contains df1.Account (not the full match with a normal merge/join!)
Desired output:
df3:
Account | Col_B | Col_C | ID | Name
--------------------------------------------------------------
B36363-0 | text_b1 | text_c1 | 2019001 | John
01_G47281 | text_b2 | text_c3 | 2019002;2018101 | Alice;Emma
X_H46291 | text_b3 | text_c3 | 2019001 | John
II_G47281 | text_b4 | text_C4 | 2019002;2018101 | Alice;Emma
I have no sample code because I have no idea how to handle this. A normal merge/join goes well, but not if I want to use contains. Thank you very much in advance
Upvotes: 2
Views: 122
Reputation: 75100
You can try str.extract
with join()
:
d=df1.set_index('Account').agg(list,axis=1).to_dict()
p='({})'.format('|'.join(df1.Account))
#'(B36363|G47281|H46291)'
m=pd.DataFrame(df2.Account.str.extract(p,expand=False).map(d).fillna('').tolist()
,columns=['ID','Name'],index=df2.index)
df2.join(m)
Account Col_B Col_C ID Name
1 B36363-0 text_b1 text_c1 2019001 John
2 01_G47281 text_b2 text_c2 2019002;2018101 Alice;Emma
3 X_H46291 text_b3 text_c3 2019001 John
4 II_G47281 text_b4 text_C4 2019002;2018101 Alice;Emma
Upvotes: 3
Reputation: 25259
Try str.extract
on df2.Account
and set the result to index of df2
and join
pat1 = '('+'|'.join(df1.Account)+')'
s = df2.Account.str.extract(pat1, expand=False)
df2.set_index(s).join(df1.set_index('Account')).reset_index(drop=True)
Out[644]:
Account Col_B Col_C ID Name
0 B36363-0 text_b1 text_c1 2019001 John
1 01_G47281 text_b2 text_c2 2019002;2018101 Alice;Emma
2 II_G47281 text_b4 text_C4 2019002;2018101 Alice;Emma
3 X_H46291 text_b3 text_c3 2019001 John
Another way is using merge
df2.assign(Account2=df2.Account.str.extract(pat1, expand=False)) \
.merge(df1, left_on='Account2', right_on='Account', suffixes=('', 'y')) \
.drop(['Account2', 'Accounty'], 1)
Out[645]:
Account Col_B Col_C ID Name
0 B36363-0 text_b1 text_c1 2019001 John
1 01_G47281 text_b2 text_c2 2019002;2018101 Alice;Emma
2 II_G47281 text_b4 text_C4 2019002;2018101 Alice;Emma
3 X_H46291 text_b3 text_c3 2019001 John
Upvotes: 2
Reputation: 42916
Using my fuzzy_merge
function:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df3 = fuzzy_merge(df2, df1, 'Account', 'Account', threshold=80)\
.merge(df1, left_on='matches', right_on='Account', suffixes=['', '_2'])\
.drop(columns=['matches', 'Account_2'])
Output
Account Col_B Col_C ID Name
0 B36363-0 text_b1 text_c1 2019001 John
1 01_G47281 text_b2 text_c2 2019002;2018101 Alice;Emma
2 II_G47281 text_b4 text_C4 2019002;2018101 Alice;Emma
3 X_H46291 text_b3 text_c3 2019001 John
Fuzzy_merge
function from linked answer:
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
"""
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match
limit is the amount of matches will get returned, these are sorted high to low
"""
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
Upvotes: 2