Ugur
Ugur

Reputation: 2044

Processing a csv file with utf-8 text in it

I have a csv file (see [1] below) that has non ascii text in it (for instance a name like Antonio Melé. The file has a list of books with URLs, excerpts and comments.

In Python 3.5 I open and process the file like so:

# -*- coding: utf-8 -*-
import codecs
import csv 
import pdb


def select_book_matching_keyword(books, kw):
    """
    Will select the csv rows for which any column has matching keyword in it

    Snippet from csv file:
    `Django By Example,Antonio Melé,Using class-based ...`

        `Antonio Melé`  
           becomes  
        `b'Antonio Mel\xc3\xa9'`
    """
    selected_books = []
    for book in books:
        kw_in_any_column = [column for column in book if kw in column.decode()]
        # >> Without the `column.decode()` above I cannot
        #    run this list comprehension (that is if I 
        #    write `if kw in column` instead of `if kw in column.decode()
        if kw_in_any_column:
            # print(book)
            selected_books.append(book) 
    return selected_books


if __name__=='__main__':
    f = codecs.open('safari-annotations-export-3.csv', 'r', 'utf-8')
    reader = csv.reader(f)
    books = []

    for row in reader:
        book_utf8 = [column.encode("utf-8") for column in row]
        books.append(book_utf8)
        print(book_utf8)

pdb.set_trace()

Now printing the rows of the csv (see print(book_utf8) above) will give me results like:

[b'Django By Example', b'Antonio Mel\xc3\xa9', b'Using class-based views', b'2017-03-08', b'https://www.safaribooksonline.com/library/view/django-by-example/9781784391911/', b'https://www.safaribooksonline.com/library/view/django-by-example/9781784391911/ch01s09.html', b'https://www.safaribooksonline.com/a/django-by-example/5869158/', b'Using class-based views', b'']

First off, I have a byte prefix. Why? (Python 3.x treats strings as unicode by default, Python 2.7 treats it as byte by default.)

And then I have this: b'Antonio Mel\xc3\xa9' instead of Antonio Melé.

I know that I have not fully grasped the concept of encoding in Python. Read many posts here on SO, but still I don't really get it.


[1] csv file with utf-8 text

[2] Trying to print a list of rows of the csv file without encoding the colums of the row will give me an error:

(snip) ['Learning jQuery Deferreds', 'Terry Jones...', '2. The jQuery Deferred API', '2017-04-06', 'https://www.safaribooksonline.com/library/view/learning-jquery-deferreds/9781449369385/', 'https://www.safaribooksonline.com/library/view/learning-jquery-deferreds/9781449369385/ch02.html', 'https://www.safaribooksonline.com/a/learning-jquery-deferreds/6635517/', 'More Terminology: Resolve, Reject and Progress', ''] *** UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 368: ordinal not in range(128)

Upvotes: 3

Views: 2226

Answers (1)

lenz
lenz

Reputation: 5817

Typically, all encoding/decoding is done when communicating with the outer world. In your example, there are two communication steps:

  • you read from a file opened with codecs.open(),
  • you write out the result using the print() built-in.

Between this, you should always work with decoded strings, ie. type str (Python 2's unicode).

Reading from an on-disk file

The first point goes well, initially: You open the file with the correct encoding and let csv do the format parsing. This makes sure that the bytes found on the disk are correctly decoded into strings, without you having to use a decode method. (As a side note, you can omit codecs here and just use the built-in open(filename, 'r', encoding='utf-8'), but it effectively does the same thing.)

But then, you re-encode the strings with the following line:

book_utf8 = [column.encode("utf-8") for column in row]

You shouldn't do this. Now you have to process bytes instead of strings. Note:

>>> 'Antonio Melé'.encode('utf-8')
b'Antonio Mel\xc3\xa9'

The bytes type has common features with strings, but they are incompatible. That's why you have to decode each element in the select_book_matching_keyword function (which is not used in your code snippet, btw.), so that the membership test is done between strings and strings, not strings and bytes.

One of the differences between the two types is that print() uses the repr form to display bytes, thus the output will include quotes and the b prefix:

>>> print(b'Antonio Mel\xc3\xa9')
b'Antonio Mel\xc3\xa9'

Compare to printing strings:

>>> print('Antonio Melé')
Antonio Melé

Writing text or data to STDOUT

This brings us to the next problem: Writing data to STDOUT using print(). If you try the above line, you probably get an exception:

>>> print('Antonio Melé')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 11: ordinal not in range(128)

The problem is that, apparently, 'ascii' encoding is used. Now, how do you specify the encoding? It's clear when using open to write to a file on disk:

f = open(filename, 'w', encoding='utf8')
f.write('Antonio Melé')
f.close()

But you can't tell print what encoding to use. The reason for that is that it uses a file handle that is already open, ie. sys.stdout. In my case, this is:

>>> sys.stdout
<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>

but you probably see encoding='ascii' or something like 'ANSI_X3.4-1968'.

You have two possibilities:

  • You write the output to a disk file, and don't use print at all.
  • You change the encoding of sys.stdout.
    (More precisely, you replace it with a new TextIOWrapper around the underlying bytes-based STDOUT stream.)

The first possibility is obvious, I hope. For the second one, you need one additional line of code (provided that sys is imported):

sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)

Now print will encode strings with UTF-8.

However, you might still have trouble: It's quite likely that your terminal isn't configured to accept and properly display UTF-8 text, or that it doesn't even support Unicode. If that is the case, you either get garbled characters on screen, or maybe another exception. But that problem is outside Python, you'll have to fix it through the terminal config, or by switching to a different one.

Upvotes: 3

Related Questions