Reputation: 6782
I have two data-frames like :
df1 :
df2 :
I am trying make a match of any term to text.
MyCode :
import sys,os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import csv
import re
# data
data1 = {'termID': [1,55,341,41,5685], 'term':['Cardic Arrest','Headache','Chest Pain','Muscle Pain', 'Knee Pain']}
data2 = {'textID': [25,12,52,35], 'text':['Hello Mike, Good Morning!!',
'Oops!! My Knee pains!!',
'Stop Music!! my head pains',
'Arrest Innocent!!'
]}
#Dataframes
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Matching logic
matchList=[]
for index_b, row_b in df2.iterrows():
for index_a, row_a in df1.iterrows():
if row_a.term.lower() in row_b.text.lower() :
#print(row_b.text, row_a.term)
matchList.append([row_b.textID,row_b.text ,row_a.term, row_a.termID] )
cols = ['textID', 'text,','term ','termID' ]
d = pd.DataFrame(matchList, columns = cols)
print(d)
Which gave me only single row as output :
I have two issues to fix:
What optimum ways are there to fix these two issues?
Upvotes: 0
Views: 361
Reputation: 474
I've a quick fix for problem 1 but not an optimisation. You only get one match because "Knee pain" is the only term that appears in full in df1. I've modified the if statement to split the text from df2 and check if there are any matches from the list. Agree with @jakub that there are libraries that will do this quicker.
# Matching logic
matchList=[]
for index_b, row_b in df2.iterrows():
print(row_b)
for index_a, row_a in df1.iterrows():
if any(word in row_a.term.lower() for word in row_b.text.lower().split()):
#print(row_b.text, row_a.term)
matchList.append([row_b.textID,row_b.text ,row_a.term, row_a.termID] )
cols = ['textID', 'text,','term ','termID' ]
d = pd.DataFrame(matchList, columns = cols)
print(d)
Output
textID text, term termID
0 12 Oops!! My Knee pains!! Knee Pain 5685
1 52 Stop Music!! my head pains Headache 55
2 35 Arrest Innocent!! Cardic Arrest 1
Upvotes: 1