Reputation: 335
I'm trying to do a parser in Python3, which replace special spanish characters that don't appear in english alphabet. To do this, I have a csv text file with all transformations (encoded in utf-8):
\u00c1,\u0041
\u00c9,\u0045
...
\u00fc,\u0075
But when I run the parser doesn't do anything. On the other hand, if I do this, works perfectly:
text.replace('\u00c1', '\u0041')
Here is the code:
#!/usr/bin/env python3
from csv import reader
class Parser():
def __init__(self, lang):
self.lang = lang
def replace(self, text):
with open('./data/{}/replace.csv'.format(self.lang), 'r') as file:
csvreader = reader(file)
for l in csvreader:
# text = text.replace('\u00f1','\u006e') This works
text = text.replace(l[0],l[1])
return text
def main():
myparser = Parser('spanish')
with open('/home/marco/Escritorio/ejemplo.txt', 'r') as file:
text = file.read()
print(myparser.replace(text))
if __name__ == '__main__':
main()
Upvotes: 0
Views: 164
Reputation: 177991
Another way is to decompose the original text characters into their unaccented character and its accent mark. Next encode to ASCII ignoring errors and it will remove all the non-ASCII accent marks. Decode it again to get it back to Unicode if needed.
>>> import unicodedata
>>> s ='áéíóúüñ'
>>> unicodedata.normalize('NFD',s)
'a\u0301e\u0301i\u0301o\u0301u\u0301u\u0308n\u0303'
>>> unicodedata.normalize('NFD',s).encode('ascii',errors='ignore')
b'aeiouun'
>>> unicodedata.normalize('NFD',s).encode('ascii',errors='ignore').decode('ascii')
'aeiouun'
Upvotes: 0
Reputation: 87134
Open the CSV file in binary mode and then convert each line from "escaped unicode", e.g. '\\u00c1'
, to unicode (type str
in Python 3) before the CSV reader gets it's hands on the data:
def replace(self, text):
with open('./data/{}/replace.csv'.format(self.lang), 'rb') as f:
csvreader = reader(line.decode('unicode_escape') for line in f)
for l in csvreader:
text = text.replace(l[0], l[1])
return text
Using str.decode('unicode_escape')
will decode the incoming data from escaped unicode, into it's unicode encoding. The decoding will be memory efficient because it makes use of a generator which avoids reading the entire CSV into memory. Once that is done the CSV module will handle the data as unicode, and the string replacement should work as you expect.
Upvotes: 1