Reputation: 15
I have a dataframe that contain 4 columns, one of them is called action_description, it contain "free text" that resume the different actions done to resolve an issue.
The words in this column are sometimes miswritting, and we have a dictionnary, for all famous miswords (example: REPLCD -> REPLACED, .....)
I want to replace all miswords in my column using python code.
Here's the code I use:
Code:
import sys
import pyspark
import pandas_datareader
import re
import csv
import xlrd
import pandas as pd
import numpy as np
import datetime
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pandas import DataFrame
from pandas_datareader import data, wb
from pandas import *
xls = ExcelFile("test_doc_2.xls")
df = xls.parse(xls.sheet_names[0])
df.drop(df.columns[[0, 1]],inplace=True,axis=1)
df2 = Series(df.TO_VALUE.values,index=df.FROM_VALUE).to_dict()
xls1 = ExcelFile("Test_Source_New_2.xls")
df1 = xls1.parse(xls1.sheet_names[0])
df1['WORK_PERFORMED_NEW'] = df1['WORK_PERFORMED'].replace(df2, regex=True)
This solution work, except in some cases,
in my dictionary: DEF -> DEFERRED, DEFERED -> DEFERRED
so with my solution: DEFERED -> DEFERREDERED, at it replaced DEF in DEFERED by DEFERRED and it got concatenated with ERED, DEFERRED+ERED.
I thought about using boundaries (r"\b"), but I got syntax error !!!
How can I overcome this issue .
Thank you in advance.
Upvotes: 0
Views: 262
Reputation: 3580
I guess the issue you are facing is due to regex = True. As you mentioned you have a dictionary:
DEF -> DEFERRED, DEFERED -> DEFERRED
So,when you pass DEFERED , it first finds DEF and replaces it with DEFERRED in place followed by ERED. So you will get:
DEFERED -> DEFERREDERED
Simplified:
DEF +ERED -> DEFERRED + ERED -> DEFFEREDERED
In case of any query, you can comment.
Upvotes: 1