Reputation: 541
I am having some troubles working with the whole UTF-8 de-/encoding. The following function should replace multiple words in given data. Background behind it is, that I am parsing multiple PDF-Documents and trying to find certain keywords for further use. But as parsing PDF to string will lead to misspellings, this function works around it. I shrunk the function significantly, normaly there is more replacements and more types and so on, the main problem exists in this small part, though.
While replaceSimilar1() works perfectly fine, replaceSimilar2() will not replace the words the way I want it to. The txt documents hold the exact same entries as the arrays and is saved in UTF-8. I know, that it has to do with en-/decoding some of the parts, but no matter what I tried up until now nothing worked. There is no exception raised, it just doesn't replace the given words.
Here is my Code (including a main for testing):
# -*- coding: utf-8 -*-
import codecs
RESOURCE_PATH="resource"
def replaceSimilar1(data, type):
ZMP_array=["zählpunkt:", "lpunkt:", "zmp:", "hipunkt:", "h punkt:", "htpunkt:", ]
adress_array=["zirkusweg", "zirktisweg", "rkusweg", "zirnusweg", "jürgen-töpfer", "-töpfer", "jürgen-", "pfer-stras", "jürgentöpfer", "ürgenlöpfer"]
if type=="adress":
array=adress_array
elif type=="zmp":
array=ZMP_array
else:
array=["",""]
for word in array:
data=data.lower().replace(word, type)
return data
def replaceSimilar2(data, type):
c_file=codecs.open(RESOURCE_PATH+"\\"+type+".txt", "r+", "utf-8")
for line in c_file.readlines():
data=data.lower().replace(line.encode("utf-8"), type)
c_file.close()
return data
if __name__=="__main__":
testString="this lpunkt: should be replaced by zmp as well as -töpfer should be replaced by adress..."
print("testString: "+testString)
#PART 1:
replaced1=replaceSimilar1(testString, "zmp")
replaced1=replaceSimilar1(replaced1, "adress")
print("replaced 1: "+replaced1)
# PART 2:
replaced2=replaceSimilar2(testString, "zmp")
replaced2=replaceSimilar2(replaced2, "adress")
print("replaced 2: "+replaced2)
Upvotes: 1
Views: 1348
Reputation: 1276
The problem is not the encoding, but the fact that when you read the file, line
ends with a newline char (\n
). Use line.strip()
instead, changing the function to
def replaceSimilar2(data, type):
c_file=codecs.open(RESOURCE_PATH+"\\"+type+".txt", "r+", "utf-8")
for line in c_file:
data=data.lower().replace(line.strip().encode("utf-8"), type)
c_file.close()
return data
Upvotes: 1