Reputation: 69
I am trying to find and replace words from the 20K comments. Find and replace words are stored in dataframe and its around more than 20000. Comments in different dataframe and its around 20K.
Below is the example
import pandas as pd
df1 = pd.DataFrame({'Data' : ["Hull Damage happened and its insured by maritime hull insurence company","Non Cash Entry and claims are blocked"]})
df2 = pd.DataFrame({ 'Find' : ["Insurence","Non cash entry"],
'Replace' : ["Insurance","Blocked"],
})
And I am expecting the output below
op = ["Hull Damage happened and its insured by maritime hull insurance company","Blocked and claims are blocked"]})
Please help.
I am using loop but its taking more than 20 mins to do this. 20 k records in the data, 30000 words to be replaced
""KeywordSynonym"" -- Dataframe holds find and replace data in sql
""backup"" -- Dataframe hold data to be cleaned
backup = str(backup)
TrainingClaimNotes_KwdSyn = []
for index,row in KeywordSynonym.iterrows():
word = KeywordSynonym.Synonym[index].lower()
value = KeywordSynonym.Keyword[index].lower()
my_regex = r"\b(?=\w)" + re.escape(word) + r"\b(?!\w)"
if re.search(my_regex,backup):
backup = re.sub(my_regex, value, backup)
TrainingClaimNotes_KwdSyn.append(backup)
TrainingClaimNotes_KwdSyn_Cmp = backup.split('\'", "\'')
Upvotes: 2
Views: 268
Reputation: 9081
Use:
import pandas as pd
df1 = pd.DataFrame({'Data' : ["Hull Damage happened and its insured by maritime hull insurence company","Non Cash Entry and claims are blocked"]})
df2 = pd.DataFrame({ 'Find' : ["Insurence","Non cash entry"],
'Replace' : ["Insurance","Blocked"],
})
find_repl = dict(zip(df2['Find'].str.lower(), df2['Replace'].str.lower()))
d2 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in find_repl.items()}
df1['Data_1'] = df1['Data'].str.lower().replace(d2, regex=True)
Output
>>> print(df1['Data_1'].tolist())
['hull damage happened and its insured by maritime hull insurance company', 'blocked and claims are blocked']
Explanation
dict(zip(df2['Find'].str.lower(), df2['Replace'].str.lower()))
creates a mapping between what you want to replace and the string you want to replace with -
{'insurence': 'insurance', 'non cash entry': 'blocked'}
Convert the lookups to regex
making it ready for lookup -
d2 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in find_repl.items()}
{'(\\b)insurence(\\b)': '\\1insurance\\2', '(\\b)non cash entry(\\b)': '\\1blocked\\2'}
The final piece is just making the actual replacement -
df1['Data_1'] = df1['Data'].str.lower().replace(d2, regex=True)
Note: I did a .lower()
everywhere to find proper matches. Obviously you can reshape it to the way you want it to look.
Upvotes: 1