Reputation: 3611
I was trying to use csv.DictReader to parse UTF-8 data with special characters but I was getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)
I read online and found out that Python 2.7's csv library doesn't handle Unicode. I looked for an alternative library and found unicodecsv
.
I replaced csv with unicodecsv but I get the same error. Here's a simplified version of my code:
from io import StringIO
from unicodecsv import DictReader, Dialect, QUOTE_MINIMAL
data = (
'first_name,last_name,email\r'
'Elmer,Fudd,[email protected]\r'
'Jo\xc3\xa3o Ant\xc3\xb4nio,Ara\xc3\xbajo,[email protected]\r'
)
unicode_data = StringIO(unicode(data, 'utf-8-sig'), newline=None)
class CustomDialect(Dialect):
delimiter = ','
doublequote = True
escapechar = '\\'
lineterminator = '\r\n'
quotechar = '"'
quoting = QUOTE_MINIMAL
skipinitialspace = True
rows = DictReader(unicode_data, dialect=CustomDialect)
for row in rows:
print row
If I replace StringIO with BytesIO, the encoding works but I can't send the newlines
argument anymore and then I get:
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Does anybody have any idea how I could solve this? Shouldn't unicodecsv be handling StringIO? Thanks
Upvotes: 1
Views: 750
Reputation: 3611
I opened an issue in the unicodecsv github page and it turns out (a bit counterintuitively imo) that the unicodecsv reader expects a bytestring and not a unicode object.
After taking some time to make this whole thing with Unicode and encodings clearer in my head, it turns out I didn't really need unicodecsv in the first place. After all, the initial problem is that io.StringIO
, when iterated with .next()
, was returning unicode objects to the csv.DictReader, which expected bytestrings. So if unicodecsv also expects bytestrings it obviously can't solve the problem.
My solution was changing the file-like object I was passing to the csv.DictReader so that it returned properly encoded bytestrings instead of unicode objects:
class UTF8EncodedStringIO(StringIO):
def next(self):
return super(UTF8EncodedStringIO, self).next().encode('utf-8')
udata = UTF8EncodedStringIO(unicode(data, 'utf-8-sig'), newline=None)
By writing this simple wrapper around StringIO instead of using BytesIO I could solve the encoding problems and profit from the newline
argument. There's a bit of decoding/encoding overhead but I was out of alternatives. If somebody has a better suggestion, feel free to share.
Upvotes: 1