Reputation: 2044
I have a csv file (see [1] below) that has non ascii text in it (for instance a name like Antonio Melé
. The file has a list of books with URLs, excerpts and comments.
In Python 3.5 I open and process the file like so:
# -*- coding: utf-8 -*-
import codecs
import csv
import pdb
def select_book_matching_keyword(books, kw):
"""
Will select the csv rows for which any column has matching keyword in it
Snippet from csv file:
`Django By Example,Antonio Melé,Using class-based ...`
`Antonio Melé`
becomes
`b'Antonio Mel\xc3\xa9'`
"""
selected_books = []
for book in books:
kw_in_any_column = [column for column in book if kw in column.decode()]
# >> Without the `column.decode()` above I cannot
# run this list comprehension (that is if I
# write `if kw in column` instead of `if kw in column.decode()
if kw_in_any_column:
# print(book)
selected_books.append(book)
return selected_books
if __name__=='__main__':
f = codecs.open('safari-annotations-export-3.csv', 'r', 'utf-8')
reader = csv.reader(f)
books = []
for row in reader:
book_utf8 = [column.encode("utf-8") for column in row]
books.append(book_utf8)
print(book_utf8)
pdb.set_trace()
Now printing the rows of the csv (see print(book_utf8)
above) will give me results like:
[b'Django By Example', b'Antonio Mel\xc3\xa9', b'Using class-based views', b'2017-03-08', b'https://www.safaribooksonline.com/library/view/django-by-example/9781784391911/', b'https://www.safaribooksonline.com/library/view/django-by-example/9781784391911/ch01s09.html', b'https://www.safaribooksonline.com/a/django-by-example/5869158/', b'Using class-based views', b'']
First off, I have a byte prefix. Why? (Python 3.x treats strings as unicode by default, Python 2.7 treats it as byte by default.)
And then I have this: b'Antonio Mel\xc3\xa9'
instead of Antonio Melé
.
I know that I have not fully grasped the concept of encoding in Python. Read many posts here on SO, but still I don't really get it.
utf-8
? I did that.[2] Trying to print a list of rows of the csv file without encoding the colums of the row will give me an error:
(snip)
['Learning jQuery Deferreds', 'Terry Jones...', '2. The jQuery Deferred API', '2017-04-06', 'https://www.safaribooksonline.com/library/view/learning-jquery-deferreds/9781449369385/', 'https://www.safaribooksonline.com/library/view/learning-jquery-deferreds/9781449369385/ch02.html', 'https://www.safaribooksonline.com/a/learning-jquery-deferreds/6635517/', 'More Terminology: Resolve, Reject and Progress', '']
*** UnicodeEncodeError: 'ascii' codec can't encode character '\u2019' in position 368: ordinal not in range(128)
Upvotes: 3
Views: 2226
Reputation: 5817
Typically, all encoding/decoding is done when communicating with the outer world. In your example, there are two communication steps:
codecs.open()
,print()
built-in.Between this, you should always work with decoded strings, ie. type str
(Python 2's unicode
).
The first point goes well, initially: You open the file with the correct encoding and let csv
do the format parsing.
This makes sure that the bytes found on the disk are correctly decoded into strings, without you having to use a decode
method.
(As a side note, you can omit codecs
here and just use the built-in open(filename, 'r', encoding='utf-8')
, but it effectively does the same thing.)
But then, you re-encode the strings with the following line:
book_utf8 = [column.encode("utf-8") for column in row]
You shouldn't do this. Now you have to process bytes
instead of strings.
Note:
>>> 'Antonio Melé'.encode('utf-8')
b'Antonio Mel\xc3\xa9'
The bytes
type has common features with strings, but they are incompatible.
That's why you have to decode
each element in the select_book_matching_keyword
function (which is not used in your code snippet, btw.), so that the membership test is done between strings and strings, not strings and bytes.
One of the differences between the two types is that print()
uses the repr
form to display bytes
, thus the output will include quotes and the b
prefix:
>>> print(b'Antonio Mel\xc3\xa9')
b'Antonio Mel\xc3\xa9'
Compare to printing strings:
>>> print('Antonio Melé')
Antonio Melé
This brings us to the next problem: Writing data to STDOUT using print()
.
If you try the above line, you probably get an exception:
>>> print('Antonio Melé')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 11: ordinal not in range(128)
The problem is that, apparently, 'ascii'
encoding is used.
Now, how do you specify the encoding?
It's clear when using open
to write to a file on disk:
f = open(filename, 'w', encoding='utf8')
f.write('Antonio Melé')
f.close()
But you can't tell print
what encoding to use.
The reason for that is that it uses a file handle that is already open, ie. sys.stdout
. In my case, this is:
>>> sys.stdout
<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
but you probably see encoding='ascii'
or something like 'ANSI_X3.4-1968'
.
You have two possibilities:
print
at all.sys.stdout
.The first possibility is obvious, I hope.
For the second one, you need one additional line of code (provided that sys
is imported):
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)
Now print
will encode strings with UTF-8.
However, you might still have trouble: It's quite likely that your terminal isn't configured to accept and properly display UTF-8 text, or that it doesn't even support Unicode. If that is the case, you either get garbled characters on screen, or maybe another exception. But that problem is outside Python, you'll have to fix it through the terminal config, or by switching to a different one.
Upvotes: 3