Reputation: 13178
I'm trying to allow users to signup to my service and I'm noticing errors whenever somebody signs up with Latin american characters in their name.I tried reading several SO posts/websites as per below:
Python regex against Latin-1 character encoding?
http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character-encodings-0
http://docs.python.org/2/library/json.html
https://pypi.python.org/pypi/anyjson
but was still unable to solve it. My code example is as per below:
>>> val = json.dumps({"name":"Déjà"}, encoding="ISO-8859-1")
>>> val
'{"name": "D\\u00c3\\u00a9j\\u00c3\\u00a0"}'
Is there anyway to force the encoding to work in this case for both that and deserializing? Any help is appreciated!
EDIT
The client is Android and iPhone applications. I'm using the following libraries to encode the json on the clients:
http://loopj.com/android-async-http/ (android)
https://github.com/AFNetworking/AFNetworking (ios)
EDIT 2 The same text was received by the server from the Android client as per below:
{"NAME":"D\ufffdj\ufffd"}
I was using anyjson
to deserialize that and it said:
File "/usr/local/lib/python2.7/dist-packages/anyjson/__init__.py", line 135, in loads
return implementation.loads(value)
File "/usr/local/lib/python2.7/dist-packages/anyjson/__init__.py", line 99, in loads
return self._decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 454, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 374, in decode
obj, end = self.raw_decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 393, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
ValueError: ('utf8', "D\xe9j\xe0", 1, 2, 'invalid continuation byte')
Upvotes: 0
Views: 1027
Reputation: 13178
Turns out this was mainly an issue in how I was doing the encoding from the Android side.
I am now setting the StringEntity
this way in Android and it's working now:
StringEntity se = new StringEntity(obj.toString(), "UTF-8");
se.setContentType("application/json;charset=UTF-8");
se.setContentEncoding( new BasicHeader(HTTP.CONTENT_TYPE, "application/json"));
Also, I was using anyjson
on the server which was using simplejson
. This was creating errors at times as well. I switched to using the json
library for Python.
Upvotes: 0
Reputation: 44256
JSON should almost always be in Unicode (when encoded), and if you're writing a webserver, UTF-8. The following, in Python 3, is basically correct:
In [1]: import json
In [2]: val = json.dumps({"name":"Déjà"})
In [3]: val
Out[3]: '{"name": "D\\u00e9j\\u00e0"}'
A closer look:
'{"name": "D\\u00e9j\\u00e0"}'
^^^^^^^
The text \u00e9, which in JSON means "é".
The slash is doubled because we're looking at a repr of a str.
You can then send val to the client, and in Javascript, JSON.parse
should give you the right result.
Because you mentioned, "when somebody signs up": that implies data coming from the client (web browser) to you. How is that data being sent? What library/libraries are you writing a webserver in?
Upvotes: 3