BarFooBar
BarFooBar

Reputation: 1136

How to parse text from an excel document in python 3?

I have a csv with several accented characters including country names. I'm using a csv reader with a specified encoding and dialect to parse it, but it's not handling the accents well.

p = re.compile('(?<=n).*?(?=,)')
with open('/file.csv', 'rt', encoding='cp1252') as csvFile:
    reader = csv.reader(csvFile, dialect='excel')
    next(csvFile)
    for row in reader:
        print(row[0])
        accented_words = p.findall(row[8])[0].strip()
        print(accented_words)

p is a regex that pulls some accented characters out. It gives me results like 'C™te dÕIvoire'. How can I get past this and preserve the accented characters?

Upvotes: 2

Views: 268

Answers (1)

jfs
jfs

Reputation: 414905

The correct way to parse a csv file that uses excel dialect in Python 3:

with open('/file.csv', newline='', encoding=correct_encoding) as file:
    reader = csv.reader(file)

Your issue might be the incorrect input character encoding:

>>> print(u'Côte d’Ivoire'.encode('utf-8').decode('cp1252'))
Côte d’Ivoire

The example shows what happens if utf-8 data is decoded as cp1252.

Upvotes: 1

Related Questions