Marco Canora
Marco Canora

Reputation: 335

Can't replace readed unicode chars to others in a text

I'm trying to do a parser in Python3, which replace special spanish characters that don't appear in english alphabet. To do this, I have a csv text file with all transformations (encoded in utf-8):

\u00c1,\u0041

\u00c9,\u0045

...

\u00fc,\u0075

But when I run the parser doesn't do anything. On the other hand, if I do this, works perfectly:

text.replace('\u00c1', '\u0041')

Here is the code:

#!/usr/bin/env python3

from csv import reader

class Parser():

    def __init__(self, lang):
        self.lang = lang

    def replace(self, text):
        with open('./data/{}/replace.csv'.format(self.lang), 'r') as file:
            csvreader = reader(file)
            for l in csvreader:
                # text = text.replace('\u00f1','\u006e') This works
                text = text.replace(l[0],l[1])
        return text

def main():
    myparser = Parser('spanish')
    with open('/home/marco/Escritorio/ejemplo.txt', 'r') as file:
        text = file.read()
        print(myparser.replace(text))

if __name__ == '__main__':
    main()

Upvotes: 0

Views: 164

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177991

Another way is to decompose the original text characters into their unaccented character and its accent mark. Next encode to ASCII ignoring errors and it will remove all the non-ASCII accent marks. Decode it again to get it back to Unicode if needed.

>>> import unicodedata
>>> s ='áéíóúüñ'
>>> unicodedata.normalize('NFD',s)
'a\u0301e\u0301i\u0301o\u0301u\u0301u\u0308n\u0303'
>>> unicodedata.normalize('NFD',s).encode('ascii',errors='ignore')
b'aeiouun'
>>> unicodedata.normalize('NFD',s).encode('ascii',errors='ignore').decode('ascii')
'aeiouun'

Upvotes: 0

mhawke
mhawke

Reputation: 87134

Open the CSV file in binary mode and then convert each line from "escaped unicode", e.g. '\\u00c1', to unicode (type str in Python 3) before the CSV reader gets it's hands on the data:

def replace(self, text):
    with open('./data/{}/replace.csv'.format(self.lang), 'rb') as f:
        csvreader = reader(line.decode('unicode_escape') for line in f)
        for l in csvreader:
            text = text.replace(l[0], l[1])
    return text

Using str.decode('unicode_escape') will decode the incoming data from escaped unicode, into it's unicode encoding. The decoding will be memory efficient because it makes use of a generator which avoids reading the entire CSV into memory. Once that is done the CSV module will handle the data as unicode, and the string replacement should work as you expect.

Upvotes: 1

Related Questions