Reputation: 15602
I'm trying to dump HTML from websites into JSON, and I need a way to handle the different character encodings.
I've read that if it isn't utf-8, it's probably ISO-8859-1, so what I'm doing now is:
for possible_encoding in ["utf-8", "ISO-8859-1"]:
try:
# post_dict contains, among other things, website html retrieved
# with urllib2
json = simplejson.dumps(post_dict, encoding=possible_encoding)
break
except UnicodeDecodeError:
pass
if json is None:
raise UnicodeDecodeError
This will of course fail if I come across any other encodings, so I'm wondering if there is a way to solve this problem in the general case.
The reason I'm trying to serialize the HTML in the first place is because I need to send it in a POST request to our NodeJS server. So, if someone has a different solution that allows me to do that (maybe without serializing to JSON at all), I'd be happy to hear that as well.
Upvotes: 4
Views: 2220
Reputation: 414255
You should know the character encoding regardless of media type you use to send POST request (unless you want to send binary blobs). To get the character encoding of your html content, see A good way to get the charset/encoding of an HTTP response in Python .
To send post_dict
as json, make sure all strings in it are Unicode (just convert html to Unicode as soon as you receive it) and don't use the encoding
parameter for json.dumps()
call. The parameter won't help you anyway if different web-sites (where you get your html strings) use different encodings.
Upvotes: 1