Chris J. Vargo
Chris J. Vargo

Reputation: 2446

Converting ASCII output to UTF-8

I'm really close having a script that fetches JSON from the New York Times API, then converts it to CSV. However, occasionally I get this error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 21: ordinal not in range(128)

I think I could avoid this all together if I converted the output to UTF-8, but I am unsure how to do so. Here is my python script:

import urllib2
import json
import csv

outfile_path='/NYTComments.csv'

writer = csv.writer(open(outfile_path, 'w'))

url = urllib2.Request('http://api.nytimes.com/svc/community/v2/comments/recent?api-key=ea7aac6c5d0723d7f1e06c8035d27305:5:66594855')

parsed_json = json.load(urllib2.urlopen(url))

print parsed_json

for comment in parsed_json['results']['comments']:
    row = []
    row.append(str(comment['commentSequence']))
    row.append(str(comment['commentBody']))
    row.append(str(comment['commentTitle']))
    row.append(str(comment['approveDate']))
    writer.writerow(row)

Upvotes: 1

Views: 3129

Answers (2)

David S
David S

Reputation: 13911

A few things...

  • I don't know anything about the NewYork Times API, but I would guess you probably shouldn't publish a code snippet with your "api-key". Just a guess on this point (I've never used this API before)

  • If you look, the API is tells you the encoding. You are getting the following back in the header:

    Content-Type=application/json; charset=UTF-8 
    
  • Googling "python and UnicodeEncodeError" will give you a lot of help. But here, it seems your problem is probably calling the "str" on the comments. In which case, it will use the 'ascii' codec. And if there is a char above 128, then boom. You get the error you are seeing. Here is a pretty good blog post on the topic. It might help you to read over it.

Edit: This solution works for me:

for comment in parsed_json['results']['comments']:
    row = []
    row.append(str(comment['commentSequence']))
    row.append(comment['commentBody'].encode('UTF-8', 'replace'))
    row.append(comment['commentTitle'].encode('UTF-8', 'replace'))
    row.append(str(comment['approveDate']))
    writer.writerow(row)

Upvotes: 1

Nathan Villaescusa
Nathan Villaescusa

Reputation: 17659

Replace the second and third call to str() with unicode().

for comment in parsed_json['results']['comments']:
    row = []
    row.append(str(comment['commentSequence']))
    row.append(unicode(comment['commentBody']))
    row.append(unicode(comment['commentTitle']))
    row.append(str(comment['approveDate']))
    writer.writerow(row)

Upvotes: 0

Related Questions