Replace keywords in dataframe column using pandas dictionary

Question

I have a dataframe that contain 4 columns, one of them is called action_description, it contain "free text" that resume the different actions done to resolve an issue.

The words in this column are sometimes miswritting, and we have a dictionnary, for all famous miswords (example: REPLCD -> REPLACED, .....)

I want to replace all miswords in my column using python code.

Here's the code I use:

Code:

import sys
import pyspark
import pandas_datareader
import re
import csv
import xlrd
import pandas as pd
import numpy as np
import datetime

from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pandas import DataFrame
from pandas_datareader import data, wb
from pandas import *

xls = ExcelFile("test_doc_2.xls")
df = xls.parse(xls.sheet_names[0])
df.drop(df.columns[[0, 1]],inplace=True,axis=1)
df2 = Series(df.TO_VALUE.values,index=df.FROM_VALUE).to_dict()

xls1 = ExcelFile("Test_Source_New_2.xls")
df1 = xls1.parse(xls1.sheet_names[0])

df1['WORK_PERFORMED_NEW'] = df1['WORK_PERFORMED'].replace(df2, regex=True)

This solution work, except in some cases,

in my dictionary: DEF -> DEFERRED, DEFERED -> DEFERRED

so with my solution: DEFERED -> DEFERREDERED, at it replaced DEF in DEFERED by DEFERRED and it got concatenated with ERED, DEFERRED+ERED.

I thought about using boundaries (r"\b"), but I got syntax error !!!

How can I overcome this issue .

Thank you in advance.

Sumit S Chawla · Accepted Answer

I guess the issue you are facing is due to regex = True. As you mentioned you have a dictionary:

DEF -> DEFERRED, DEFERED -> DEFERRED

So,when you pass DEFERED , it first finds DEF and replaces it with DEFERRED in place followed by ERED. So you will get:

DEFERED -> DEFERREDERED

Simplified:

DEF +ERED -> DEFERRED + ERED -> DEFFEREDERED

In case of any query, you can comment.

Replace keywords in dataframe column using pandas dictionary

Answers (1)

Related Questions