Reputation: 13796
I'm having a problem that I believe has a simple solution.
I'm writing a Python script which reads a JSON string from a URL and parses it. To do this I'm using urllib2 and simplejson.
The problem I'm having has got to do with encoding. The URL I'm reading from does not explicitly state in which encoding it is (as far as I can tell) and it returns some Icelandic characters. I cannot give out the URL I'm reading from here, but I've set up a sample JSON data file on my own server and I'm also having problems reading that. Here is the file: http://haukurhaf.net/json.txt
This is my code:
# coding: utf-8
#!/usr/bin/env python
import urllib2, re, os
from BeautifulSoup import BeautifulSoup
import simplejson as json
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'
def fetchPage(url):
req = urllib2.Request(url)
req.add_header('User-Agent', user_agent)
response = urllib2.urlopen(req)
html = response.read()
response.close()
return html
html = fetchPage("http://haukurhaf.net/json.txt")
jsonData = json.JSONDecoder().decode(html)
The JSON parser crashes with this error message: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 35: invalid continuation byte
Since I do not have any control over the server which holds the JSON data, I cannot control which encoding headers it sends out. I'm hoping I can solve this on my end somehow.
Any ideas?
Upvotes: 3
Views: 2732
Reputation: 3443
The file is encoded using Latin-1, not UTF-8, so you have to specify the encoding:
jsonData = json.JSONDecoder('latin1').decode(html)
BTW: html
is a bad name for a JSON document...
Upvotes: 2
Reputation: 31299
You need to make the string unicode first (it's latin-1 right now):
uhtml = html.decode("latin-1")
jdata = json.loads(uhtml)
Or, if simplejson
doesn't have loads
:
json.JSONDecoder().decode(uhtml)
Upvotes: -1
Reputation: 536339
This resource is encoded as ISO-8859-1, or, more likely, the Windows variant code page 1252. It is not UTF-8.
You can read it with response.read().decode('cp1252')
to get a Unicode string which [simple]json
should also be able to parse.
However, in byte form, JSON must be encoded in a UTF. Therefore this is not valid JSON, and it will fail if you attempt to load it from a browser too.
Upvotes: 1