Python - replace() and utf-8 encoding / decoding using .txt files with codecs

Question

I am having some troubles working with the whole UTF-8 de-/encoding. The following function should replace multiple words in given data. Background behind it is, that I am parsing multiple PDF-Documents and trying to find certain keywords for further use. But as parsing PDF to string will lead to misspellings, this function works around it. I shrunk the function significantly, normaly there is more replacements and more types and so on, the main problem exists in this small part, though.

While replaceSimilar1() works perfectly fine, replaceSimilar2() will not replace the words the way I want it to. The txt documents hold the exact same entries as the arrays and is saved in UTF-8. I know, that it has to do with en-/decoding some of the parts, but no matter what I tried up until now nothing worked. There is no exception raised, it just doesn't replace the given words.

Here is my Code (including a main for testing):

# -*- coding: utf-8 -*-
import codecs


RESOURCE_PATH="resource"


def replaceSimilar1(data, type):

    ZMP_array=["zählpunkt:", "lpunkt:", "zmp:", "hipunkt:", "h punkt:", "htpunkt:", ]
    adress_array=["zirkusweg", "zirktisweg", "rkusweg", "zirnusweg", "jürgen-töpfer", "-töpfer", "jürgen-", "pfer-stras", "jürgentöpfer", "ürgenlöpfer"]

    if type=="adress":
        array=adress_array

    elif type=="zmp":
        array=ZMP_array

    else:
        array=["",""]

    for word in array:
        data=data.lower().replace(word, type)

    return data


def replaceSimilar2(data, type):
    c_file=codecs.open(RESOURCE_PATH+"\"+type+".txt", "r+", "utf-8")
    for line in c_file.readlines():
        data=data.lower().replace(line.encode("utf-8"), type)
    c_file.close()
    return data


if __name__=="__main__":

    testString="this lpunkt: should be replaced by zmp as well as -töpfer should be replaced by adress..."
    print("testString: "+testString)

    #PART 1:
    replaced1=replaceSimilar1(testString, "zmp")
    replaced1=replaceSimilar1(replaced1, "adress")
    print("replaced 1: "+replaced1)

    # PART 2:
    replaced2=replaceSimilar2(testString, "zmp")
    replaced2=replaceSimilar2(replaced2, "adress")
    print("replaced 2: "+replaced2)

vad · Accepted Answer

The problem is not the encoding, but the fact that when you read the file, line ends with a newline char (). Use line.strip() instead, changing the function to

def replaceSimilar2(data, type):
    c_file=codecs.open(RESOURCE_PATH+"\"+type+".txt", "r+", "utf-8")
    for line in c_file:
        data=data.lower().replace(line.strip().encode("utf-8"), type)
    c_file.close()
    return data

Python - replace() and utf-8 encoding / decoding using .txt files with codecs

Answers (1)

Related Questions