Reputation: 65
I have the following two dataframes -
DataFrame - A
DataFrame - B
Now, I want to perform a join on both DataFrames on the columns 'Code - A' and 'Code - B' such that the join is performed if the value in 'Code - A' is a substring of or contains the value in 'Code - B'. A sample output for this example would be as follows -
The values in 'Code - A' and 'Code - B' are of type int, which I am converting to a string for easier manipulation. And, so far, I have implemented the following code:
data_A = [['A', 100010], ['B', 200010]]
A = pd.DataFrame(data_A, columns = ['Type - A', 'Code - A'])
data_B = [['AA', 1000], ['BB', 10001020],['CC',10001030],['DD',10001040],['EE',20001020],['FF',20001030],['GG',200010]]
B = pd.DataFrame(data_B, columns = ['Type - B', 'Code - B'])
A['Code - A'] = A['Code - A'].astype(str)
B['Code - B'] = B['Code - B'].astype(str)
A['join'] = 1 #creating a dummy column in Dataframe A called join
B['join'] = 1 #creating a dummy column in Dataframe B called join
cartesian = A.merge(B, on='join').drop('join', axis=1) #getting the cartesian product
cartesian['Match'] = cartesian.apply(lambda x: x['Code - A'].find(x['Code - B']), axis=1).ge(0) #checking for columns where 'Code - A' has a string which is either a substring of 'Code - B' or one where 'Code - B' is a substring of 'Code - A'
mask = cartesian['Match'] == True #filtering to see the matched items
full = cartesian.loc[mask]
However, this only returns the following Dataframe.
How should I modify my code to get the aforementioned output?
Thanks!
Upvotes: 0
Views: 44
Reputation: 1191
After you create the cartesian
dataframe, you can query it like so:
cartesian[cartesian.apply(lambda row: row['Code - B'] in row['Code - A'] or row['Code - A'] \
in row['Code - B'], axis = 1)]
Output:
Type - A Code - A Type - B Code - B
0 A 100010 AA 1000
1 A 100010 BB 10001020
2 A 100010 CC 10001030
3 A 100010 DD 10001040
11 B 200010 EE 20001020
12 B 200010 FF 20001030
13 B 200010 GG 200010
Upvotes: 2
Reputation: 603
Looking into the codes, it seems that only the first 4 digits matter, so I created a new column on each dataframe with the first 4 digits called 'key',then merged the two dataframes:
#Creating a key for each dataframe:
for i in A.index:
A.loc[i,'key'] = str(A.loc[i,'Code - A'])[:4]
for i in B.index:
B.loc[i,'key'] = str(B.loc[i,'Code - B'])[:4]
#merging the dataframes
A.merge(B)
Output:
Type - A Code - A key Type - B Code - B
0 A 100010 1000 AA 1000
1 A 100010 1000 BB 10001020
2 A 100010 1000 CC 10001030
3 A 100010 1000 DD 10001040
4 B 200010 2000 EE 20001020
5 B 200010 2000 FF 20001030
6 B 200010 2000 GG 200010
Upvotes: 1