Reputation: 1136
I have a csv with several accented characters including country names. I'm using a csv reader with a specified encoding and dialect to parse it, but it's not handling the accents well.
p = re.compile('(?<=n).*?(?=,)')
with open('/file.csv', 'rt', encoding='cp1252') as csvFile:
reader = csv.reader(csvFile, dialect='excel')
next(csvFile)
for row in reader:
print(row[0])
accented_words = p.findall(row[8])[0].strip()
print(accented_words)
p
is a regex that pulls some accented characters out. It gives me results like 'C™te dÕIvoire'. How can I get past this and preserve the accented characters?
Upvotes: 2
Views: 268
Reputation: 414905
The correct way to parse a csv file that uses excel
dialect in Python 3:
with open('/file.csv', newline='', encoding=correct_encoding) as file:
reader = csv.reader(file)
Your issue might be the incorrect input character encoding:
>>> print(u'Côte d’Ivoire'.encode('utf-8').decode('cp1252'))
Côte d’Ivoire
The example shows what happens if utf-8 data is decoded as cp1252.
Upvotes: 1