lukin
lukin

Reputation: 65

How do I join two dataframes based on a dataframe's row value being a substring/contained in another dataframe's row value?

I have the following two dataframes -

DataFrame - A

enter image description here

DataFrame - B

enter image description here

Now, I want to perform a join on both DataFrames on the columns 'Code - A' and 'Code - B' such that the join is performed if the value in 'Code - A' is a substring of or contains the value in 'Code - B'. A sample output for this example would be as follows -

enter image description here

The values in 'Code - A' and 'Code - B' are of type int, which I am converting to a string for easier manipulation. And, so far, I have implemented the following code:

data_A = [['A', 100010], ['B', 200010]] 
A = pd.DataFrame(data_A, columns = ['Type - A', 'Code - A'])
data_B = [['AA', 1000], ['BB', 10001020],['CC',10001030],['DD',10001040],['EE',20001020],['FF',20001030],['GG',200010]] 
B = pd.DataFrame(data_B, columns = ['Type - B', 'Code - B'])
A['Code - A'] = A['Code - A'].astype(str)
B['Code - B'] = B['Code - B'].astype(str)  
A['join'] = 1 #creating a dummy column in Dataframe A called join
B['join'] = 1 #creating a dummy column in Dataframe B called join
cartesian = A.merge(B, on='join').drop('join', axis=1) #getting the cartesian product
cartesian['Match'] = cartesian.apply(lambda x: x['Code - A'].find(x['Code - B']), axis=1).ge(0) #checking for columns where 'Code - A' has a string which is either a substring of 'Code - B' or one where 'Code - B' is a substring of 'Code - A'
mask = cartesian['Match'] == True #filtering to see the matched items
full = cartesian.loc[mask]

However, this only returns the following Dataframe.

enter image description here

How should I modify my code to get the aforementioned output?

Thanks!

Upvotes: 0

Views: 44

Answers (2)

jschnurr
jschnurr

Reputation: 1191

After you create the cartesian dataframe, you can query it like so:

cartesian[cartesian.apply(lambda row: row['Code - B'] in row['Code - A'] or row['Code - A'] \
  in row['Code - B'], axis = 1)]

Output:

   Type - A Code - A Type - B  Code - B
0         A   100010       AA      1000
1         A   100010       BB  10001020
2         A   100010       CC  10001030
3         A   100010       DD  10001040
11        B   200010       EE  20001020
12        B   200010       FF  20001030
13        B   200010       GG    200010

Upvotes: 2

jlb_gouveia
jlb_gouveia

Reputation: 603

Looking into the codes, it seems that only the first 4 digits matter, so I created a new column on each dataframe with the first 4 digits called 'key',then merged the two dataframes:

#Creating a key for each dataframe:
for i in A.index:
    A.loc[i,'key'] = str(A.loc[i,'Code - A'])[:4]

for i in B.index:
    B.loc[i,'key'] = str(B.loc[i,'Code - B'])[:4]

#merging the dataframes
A.merge(B)

Output:

    Type - A    Code - A    key      Type - B   Code - B
0   A           100010      1000     AA         1000
1   A           100010      1000     BB         10001020
2   A           100010      1000     CC         10001030
3   A           100010      1000     DD         10001040
4   B           200010      2000     EE         20001020
5   B           200010      2000     FF         20001030
6   B           200010      2000     GG         200010

Upvotes: 1

Related Questions