Hama Latha
Hama Latha

Reputation: 69

Find and Replace in DataFrame using Pandas in optimized way

I am trying to find and replace words from the 20K comments. Find and replace words are stored in dataframe and its around more than 20000. Comments in different dataframe and its around 20K.

Below is the example

import pandas as pd

df1 = pd.DataFrame({'Data' : ["Hull Damage happened and its insured by maritime hull insurence company","Non Cash Entry and claims are blocked"]})

df2 = pd.DataFrame({ 'Find' : ["Insurence","Non cash entry"],
                    'Replace' : ["Insurance","Blocked"],
                       }) 

And I am expecting the output below

op = ["Hull Damage happened and its insured by maritime hull insurance company","Blocked and claims are blocked"]})

Please help.

I am using loop but its taking more than 20 mins to do this. 20 k records in the data, 30000 words to be replaced

""KeywordSynonym"" -- Dataframe holds find and replace data in sql
""backup"" -- Dataframe hold data to be cleaned

backup = str(backup)
TrainingClaimNotes_KwdSyn = []
for index,row in KeywordSynonym.iterrows():
    word = KeywordSynonym.Synonym[index].lower()
    value = KeywordSynonym.Keyword[index].lower()
    my_regex = r"\b(?=\w)" + re.escape(word) + r"\b(?!\w)" 
    if re.search(my_regex,backup):
        backup = re.sub(my_regex, value, backup) 
    TrainingClaimNotes_KwdSyn.append(backup)

TrainingClaimNotes_KwdSyn_Cmp = backup.split('\'", "\'') 

Upvotes: 2

Views: 268

Answers (1)

Vivek Kalyanarangan
Vivek Kalyanarangan

Reputation: 9081

Use:

import pandas as pd

df1 = pd.DataFrame({'Data' : ["Hull Damage happened and its insured by maritime hull insurence company","Non Cash Entry and claims are blocked"]})

df2 = pd.DataFrame({ 'Find' : ["Insurence","Non cash entry"],
                    'Replace' : ["Insurance","Blocked"],
                       }) 

find_repl = dict(zip(df2['Find'].str.lower(), df2['Replace'].str.lower()))
d2 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in find_repl.items()}

df1['Data_1'] = df1['Data'].str.lower().replace(d2, regex=True)

Output

>>> print(df1['Data_1'].tolist())
['hull damage happened and its insured by maritime hull insurance company', 'blocked and claims are blocked']

Explanation

dict(zip(df2['Find'].str.lower(), df2['Replace'].str.lower())) creates a mapping between what you want to replace and the string you want to replace with -

{'insurence': 'insurance', 'non cash entry': 'blocked'}

Convert the lookups to regex making it ready for lookup -

d2 = {r'(\b){}(\b)'.format(k):r'\1{}\2'.format(v) for k,v in find_repl.items()}

{'(\\b)insurence(\\b)': '\\1insurance\\2', '(\\b)non cash entry(\\b)': '\\1blocked\\2'}

The final piece is just making the actual replacement -

df1['Data_1'] = df1['Data'].str.lower().replace(d2, regex=True)

Note: I did a .lower() everywhere to find proper matches. Obviously you can reshape it to the way you want it to look.

Upvotes: 1

Related Questions