Ariel
Ariel

Reputation: 3611

unicodecsv.DictReader not working with io.StringIO (Python 2.7)

I was trying to use csv.DictReader to parse UTF-8 data with special characters but I was getting the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe3' in position 2: ordinal not in range(128)

I read online and found out that Python 2.7's csv library doesn't handle Unicode. I looked for an alternative library and found unicodecsv.

I replaced csv with unicodecsv but I get the same error. Here's a simplified version of my code:

from io import StringIO
from unicodecsv import DictReader, Dialect, QUOTE_MINIMAL

data = (
    'first_name,last_name,email\r'
    'Elmer,Fudd,[email protected]\r'
    'Jo\xc3\xa3o Ant\xc3\xb4nio,Ara\xc3\xbajo,[email protected]\r'
)

unicode_data = StringIO(unicode(data, 'utf-8-sig'), newline=None)

class CustomDialect(Dialect):
    delimiter = ','
    doublequote = True
    escapechar = '\\'
    lineterminator = '\r\n'
    quotechar = '"'
    quoting = QUOTE_MINIMAL
    skipinitialspace = True

rows = DictReader(unicode_data, dialect=CustomDialect)

for row in rows:
    print row

If I replace StringIO with BytesIO, the encoding works but I can't send the newlines argument anymore and then I get:

Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Does anybody have any idea how I could solve this? Shouldn't unicodecsv be handling StringIO? Thanks

Upvotes: 1

Views: 750

Answers (1)

Ariel
Ariel

Reputation: 3611

I opened an issue in the unicodecsv github page and it turns out (a bit counterintuitively imo) that the unicodecsv reader expects a bytestring and not a unicode object.

After taking some time to make this whole thing with Unicode and encodings clearer in my head, it turns out I didn't really need unicodecsv in the first place. After all, the initial problem is that io.StringIO, when iterated with .next(), was returning unicode objects to the csv.DictReader, which expected bytestrings. So if unicodecsv also expects bytestrings it obviously can't solve the problem.

My solution was changing the file-like object I was passing to the csv.DictReader so that it returned properly encoded bytestrings instead of unicode objects:

class UTF8EncodedStringIO(StringIO):
    def next(self):
        return super(UTF8EncodedStringIO, self).next().encode('utf-8')

udata = UTF8EncodedStringIO(unicode(data, 'utf-8-sig'), newline=None)

By writing this simple wrapper around StringIO instead of using BytesIO I could solve the encoding problems and profit from the newline argument. There's a bit of decoding/encoding overhead but I was out of alternatives. If somebody has a better suggestion, feel free to share.

Upvotes: 1

Related Questions