HaukurHaf
HaukurHaf

Reputation: 13796

Problems parsing a JSON which is read from a URL

I'm having a problem that I believe has a simple solution.

I'm writing a Python script which reads a JSON string from a URL and parses it. To do this I'm using urllib2 and simplejson.

The problem I'm having has got to do with encoding. The URL I'm reading from does not explicitly state in which encoding it is (as far as I can tell) and it returns some Icelandic characters. I cannot give out the URL I'm reading from here, but I've set up a sample JSON data file on my own server and I'm also having problems reading that. Here is the file: http://haukurhaf.net/json.txt

This is my code:

# coding: utf-8
#!/usr/bin/env python
import urllib2, re, os
from BeautifulSoup import BeautifulSoup
import simplejson as json

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3) Gecko/2008092417 Firefox/3.0.3'

def fetchPage(url):
    req = urllib2.Request(url)
    req.add_header('User-Agent', user_agent)
    response = urllib2.urlopen(req)
    html = response.read()
    response.close()
    return html

html = fetchPage("http://haukurhaf.net/json.txt")
jsonData = json.JSONDecoder().decode(html)

The JSON parser crashes with this error message: UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position 35: invalid continuation byte

Since I do not have any control over the server which holds the JSON data, I cannot control which encoding headers it sends out. I'm hoping I can solve this on my end somehow.

Any ideas?

Upvotes: 3

Views: 2732

Answers (3)

Gandaro
Gandaro

Reputation: 3443

The file is encoded using Latin-1, not UTF-8, so you have to specify the encoding:

jsonData = json.JSONDecoder('latin1').decode(html)

BTW: html is a bad name for a JSON document...

Upvotes: 2

Nick Bastin
Nick Bastin

Reputation: 31299

You need to make the string unicode first (it's latin-1 right now):

uhtml = html.decode("latin-1")
jdata = json.loads(uhtml)

Or, if simplejson doesn't have loads:

json.JSONDecoder().decode(uhtml)

Upvotes: -1

bobince
bobince

Reputation: 536339

http://haukurhaf.net/json.txt

This resource is encoded as ISO-8859-1, or, more likely, the Windows variant code page 1252. It is not UTF-8.

You can read it with response.read().decode('cp1252') to get a Unicode string which [simple]json should also be able to parse.

However, in byte form, JSON must be encoded in a UTF. Therefore this is not valid JSON, and it will fail if you attempt to load it from a browser too.

Upvotes: 1

Related Questions