Can't replace readed unicode chars to others in a text

Question

I'm trying to do a parser in Python3, which replace special spanish characters that don't appear in english alphabet. To do this, I have a csv text file with all transformations (encoded in utf-8):

\u00c1,\u0041

\u00c9,\u0045

...

\u00fc,\u0075

But when I run the parser doesn't do anything. On the other hand, if I do this, works perfectly:

text.replace('\u00c1', '\u0041')

Here is the code:

#!/usr/bin/env python3

from csv import reader

class Parser():

    def __init__(self, lang):
        self.lang = lang

    def replace(self, text):
        with open('./data/{}/replace.csv'.format(self.lang), 'r') as file:
            csvreader = reader(file)
            for l in csvreader:
                # text = text.replace('\u00f1','\u006e') This works
                text = text.replace(l[0],l[1])
        return text

def main():
    myparser = Parser('spanish')
    with open('/home/marco/Escritorio/ejemplo.txt', 'r') as file:
        text = file.read()
        print(myparser.replace(text))

if __name__ == '__main__':
    main()

mhawke · Accepted Answer

Open the CSV file in binary mode and then convert each line from "escaped unicode", e.g. '\u00c1', to unicode (type str in Python 3) before the CSV reader gets it's hands on the data:

def replace(self, text):
    with open('./data/{}/replace.csv'.format(self.lang), 'rb') as f:
        csvreader = reader(line.decode('unicode_escape') for line in f)
        for l in csvreader:
            text = text.replace(l[0], l[1])
    return text

Using str.decode('unicode_escape') will decode the incoming data from escaped unicode, into it's unicode encoding. The decoding will be memory efficient because it makes use of a generator which avoids reading the entire CSV into memory. Once that is done the CSV module will handle the data as unicode, and the string replacement should work as you expect.

Can't replace readed unicode chars to others in a text

Answers (2)

Related Questions

Can&#39;t replace readed unicode chars to others in a text

Answers (2)

Related Questions

Can't replace readed unicode chars to others in a text