Reputation: 472
I'm trying to parse a json with some Finnish characters included. A goog example would be a region called Etelä-Karjala. I had it all working locally when I opened the json as a file and then loaded with json.load. The unicode I got for this region was u'Etel\xe4-Karjala'.
However my next step was to do the same thing on the server, and json was stored at some url from which I had to retrieve it. I used json.loads(requests.get(url).text), and the unicode that I got for the same region was now u'Etel\xc3\xa4-Karjala'.
Why do I get these different results even though the input file is the same? Can you suggest a workaround or a good pattern to load json from a url that will not cause this issue?
Here is an example to reproduce the issue:
import requests
import json
# Example with loading from request
r = requests.get('http://becs.aalto.fi/~smirnod1/maakunnat.geojson')
geo1 = json.loads(r.text)
test1 = geo1['features'][5]['properties']['text']
# test1 = u'Etel\xc3\xa4-Karjala'
Then, I download this json and try to open it as a file (this was the approach I used while I developed my application).
# Example with loading from file
with open('/Users/dmitrysmirnov/Downloads/maakunnat.geojson') as f:
geo2 = json.load(f)
test2 = geo2['features'][5]['properties']['text']
# test2 = u'Etel\xe4-Karjala'
I assume that u'Etel\xe4-Karjala' (or result of test2) should be what I aim for. Or at least that is the result that will not break the application.
Upvotes: 1
Views: 74
Reputation: 536519
json.loads(r.text)
r.text
is the content of the response decoded from bytes to unicode by the requests
module. requests
has guessed, from the lack of charset information from the server, that maybe the file content is in the ISO-8859-1 encoding.
requests
was wrong in this case, because actually the content is JSON which has its own mechanism of determining encoding, namely that it is UTF-8 by default.
If you feed the raw bytes of the response directly to the JSON parser, it will use its knowledge of the encoding rules for JSON to load the data correctly:
json.loads(r.content)
In general, text/...
media types with a charset=
parameter are best decoded in the requests
layer (using .text
), which can automatically take heed of that parameter. But application/...
types like JSON have their own in-band rules for signalling encoding, so bytes should be passed to the parsers for these types instead of being decoded in requests
.
Upvotes: 2
Reputation: 798884
The server is misconfigured. Either tell it to report that the file is encoded as UTF-8, or encode the JSON in ASCII-only.
Upvotes: 2