Caiotru
Caiotru

Reputation: 325

Pandas / Python - Merge dataframes where the key is located in 2 sub-strings

I have been asking this question quite few times and it seems that no one can answer it...

I am looking for a loop/fuction or a simple code that can look through 2 columns in different dataframes and output a third column. This example is quite different from a simple merge or a merge where we have one string and one substring... in this example we have 2 substrings to compare and output a third column if one of the key stored in the substring line is present in in the other substring line of the diffrent dataframe.

This is the example:

data = [['Alex','11111111 20'],['Bob','2222222 0000'],['Clarke','33333 999999']]
df = pd.DataFrame(data,columns=['Name','Code'])
df

data = [['Reed','0000 88'],['Ros',np.nan],['Jo','999999 66']]
df1 = pd.DataFrame(data,columns=['SecondName','Code2'])

enter image description here

What i need is to find where part of both codes are the same like 999999 or 0000 and output the SecondName

The expected output:

enter image description here

I have done my reserach and I found a way to locate a substring from a string but not from another substring like in my case.

Upvotes: 1

Views: 167

Answers (1)

Serial Lazer
Serial Lazer

Reputation: 1669

You need to split the codes and concat all possible combinations of merged-results.

Here is the working code:

import pandas as pd
import numpy as np

data = [['Alex','11111111 20'],['Bob','2222222 0000'],['Clarke','33333 999999']]
df = pd.DataFrame(data,columns=['Name','Code'])

data = [['Reed','0000 88'],['Ros',np.nan],['Jo','999999 66']]
df1 = pd.DataFrame(data,columns=['SecondName','Code2'])

df[['c1', 'c2']] = df.Code.str.split(" ", expand=True)
df1[['c1', 'c2']] = df1.Code2.str.split(" ", expand=True)

rdf = pd.DataFrame()
for col1 in ['c1', 'c2']:
    for col2 in ['c1', 'c2']:
        rdf = pd.concat([rdf, df.merge(df1, left_on=[col1], right_on=[col2], how='inner')], axis=0)

rdf = df.merge(rdf[['Name', 'SecondName']], on='Name', how='outer')
print(rdf[['Name', 'SecondName']])

Output:

     Name SecondName
0    Alex        NaN
1     Bob       Reed
2  Clarke         Jo

Upvotes: 2

Related Questions