Shivkumar kondi
Shivkumar kondi

Reputation: 6782

Partial word match between two columns of different pandas dataframes

I have two data-frames like :

df1 :

enter image description here

df2 :

enter image description here

I am trying make a match of any term to text.

MyCode :

import sys,os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import csv
import re

# data
data1 = {'termID': [1,55,341,41,5685], 'term':['Cardic Arrest','Headache','Chest Pain','Muscle Pain', 'Knee Pain']}
data2 = {'textID': [25,12,52,35], 'text':['Hello Mike, Good Morning!!',
                                         'Oops!! My Knee pains!!',
                                          'Stop Music!! my head pains',
                                          'Arrest Innocent!!'
                                         ]}

#Dataframes 
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Matching logic
matchList=[]
for index_b, row_b in df2.iterrows():
    for index_a, row_a in df1.iterrows():
        if  row_a.term.lower() in row_b.text.lower() :   
            #print(row_b.text, row_a.term)
            matchList.append([row_b.textID,row_b.text ,row_a.term, row_a.termID] )

cols = ['textID', 'text,','term ','termID' ]
d = pd.DataFrame(matchList, columns = cols)
print(d)

Which gave me only single row as output :

enter image description here


I have two issues to fix:

  1. Not sure how can I get output for any partial match like this :

enter image description here

  1. Both DF1 and DF2 are of size of around 0.4M and 13M records.

What optimum ways are there to fix these two issues?

Upvotes: 0

Views: 361

Answers (1)

DDaly
DDaly

Reputation: 474

I've a quick fix for problem 1 but not an optimisation. You only get one match because "Knee pain" is the only term that appears in full in df1. I've modified the if statement to split the text from df2 and check if there are any matches from the list. Agree with @jakub that there are libraries that will do this quicker.

# Matching logic
matchList=[]
for index_b, row_b in df2.iterrows():
    print(row_b)
    for index_a, row_a in df1.iterrows():
        if  any(word in row_a.term.lower() for word in row_b.text.lower().split()):
            #print(row_b.text, row_a.term)
            matchList.append([row_b.textID,row_b.text ,row_a.term, row_a.termID] )

cols = ['textID', 'text,','term ','termID' ]
d = pd.DataFrame(matchList, columns = cols)
print(d)

Output

   textID                       text,          term   termID
0      12      Oops!! My Knee pains!!      Knee Pain    5685
1      52  Stop Music!! my head pains       Headache      55
2      35           Arrest Innocent!!  Cardic Arrest       1

Upvotes: 1

Related Questions