Reputation: 8893
I have been working on ways to flatten text into ascii. So ā -> a and ñ -> n, etc.
unidecode
has been fantastic for this.
# -*- coding: utf-8 -*-
from unidecode import unidecode
print(unidecode(u"ā, ī, ū, ś, ñ"))
print(unidecode(u"Estado de São Paulo"))
Produces:
a, i, u, s, n
Estado de Sao Paulo
However, I can't duplicate this result with data from an input file.
Content of test.txt file:
ā, ī, ū, ś, ñ
Estado de São Paulo
# -*- coding: utf-8 -*-
from unidecode import unidecode
with open("test.txt", 'r') as inf:
for line in inf:
print unidecode(line.strip())
Produces:
A, A<<, A<<, A, A+-
Estado de SAPSo Paulo
And:
RuntimeWarning: Argument is not an unicode object. Passing an encoded string will likely have unexpected results.
Question: How can I read these lines in as unicode so that I can pass them to unidecode
?
Upvotes: 12
Views: 7314
Reputation: 308346
Use codecs.open
with codecs.open("test.txt", 'r', 'utf-8') as inf:
Edit: The above was for Python 2.x. For Python 3 you don't need to use codecs
, the encoding parameter has been added to regular open
.
with open("test.txt", 'r', encoding='utf-8') as inf:
Upvotes: 8
Reputation: 281330
import codecs
with codecs.open('test.txt', encoding='whicheveronethefilewasencodedwith') as f:
...
The codecs
module provides a function to open files with automatic Unicode encoding/decoding, among other things.
Upvotes: 5