Dump JSON from string in unknown character encoding

Question

I'm trying to dump HTML from websites into JSON, and I need a way to handle the different character encodings.

I've read that if it isn't utf-8, it's probably ISO-8859-1, so what I'm doing now is:

for possible_encoding in ["utf-8", "ISO-8859-1"]:
   try:
      # post_dict contains, among other things, website html retrieved
      # with urllib2
      json = simplejson.dumps(post_dict, encoding=possible_encoding)
      break
   except UnicodeDecodeError:
      pass
if json is None:
      raise UnicodeDecodeError

This will of course fail if I come across any other encodings, so I'm wondering if there is a way to solve this problem in the general case.

The reason I'm trying to serialize the HTML in the first place is because I need to send it in a POST request to our NodeJS server. So, if someone has a different solution that allows me to do that (maybe without serializing to JSON at all), I'd be happy to hear that as well.

jfs · Accepted Answer

You should know the character encoding regardless of media type you use to send POST request (unless you want to send binary blobs). To get the character encoding of your html content, see A good way to get the charset/encoding of an HTTP response in Python .

To send post_dict as json, make sure all strings in it are Unicode (just convert html to Unicode as soon as you receive it) and don't use the encoding parameter for json.dumps() call. The parameter won't help you anyway if different web-sites (where you get your html strings) use different encodings.

Dump JSON from string in unknown character encoding

Answers (1)

Related Questions