Reputation: 4884
I have a service that is receiving data from an outside service (through a redis list used as a queue). The data is just a flat JSON-encoded dictionary, an example may look like this:
{
"type": "visit",
"referer": "http://www.google.com/",
"session_referer": "http://www.google.com/\x0e",
"uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97",
"user_ip": "1.2.3.4",
"user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
"user_locale": "en_US",
}
The problem is that, as you can see in the above example, sometimes the referrer or session_referrer has invalid data (that can't be decoded using any of the encodings I expect such as UTF-8, ISO-8859-1, etc.).
My issue is that I can't access any of the other data. I can live with the fact that the referrer is messed up, but I still need the other data. Is there any way to do a "raw" decode without turning the data into any specific encoding and then letting me handle it from there?
Upvotes: 1
Views: 3206
Reputation: 123549
Given a text file containing your JSON-like "string" with
0E
byte in the "session_referer" value, andthe following Python code removes the troublesome values ...
# -*- coding: iso-8859-1 -*-
import json
import re
# retrieve the JSON data into a string
f = open(r'C:\Users\Gord\Desktop\jsonData.txt', 'r')
s = f.read()
f.close()
print '~> raw JSON string'
print s
print
# remove "characters" below \x20 except \n
s = re.sub(r'[\000-\011\013-\037]', '', s)
# remove (extraneous) last comma
s = re.sub(',\n}$', '\n}', s)
print '~> tweaked JSON string'
print s
print
# decode tweaked JSON string
j = json.loads(s)
# see what we got
print '~> decoded result "pretty printed"'
print json.dumps(j, sort_keys=True, indent=4, separators=(',', ': '))
print
# extract just one element
print '~> print just j["user_ip"]'
print j["user_ip"]
... and produces the following results in the Python IDLE shell:
Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> ================================ RESTART ================================
>>>
~> raw JSON string
{
"type": "visit",
"referer": "http://www.google.com/",
"session_referer": "http://www.google.com/♫",
"uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97",
"user_ip": "1.2.3.4",
"user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
"user_locale": "en_US",
}
~> tweaked JSON string
{
"type": "visit",
"referer": "http://www.google.com/",
"session_referer": "http://www.google.com/",
"uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97",
"user_ip": "1.2.3.4",
"user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
"user_locale": "en_US"
}
~> decoded result "pretty printed"
{
"referer": "http://www.google.com/",
"session_referer": "http://www.google.com/",
"type": "visit",
"user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36",
"user_ip": "1.2.3.4",
"user_locale": "en_US",
"uuid": "48e8ea41-420d-021c-be16-7ac5b7c6fb97"
}
~> print just j["user_ip"]
1.2.3.4
>>>
Upvotes: 2
Reputation: 1176
You can try by setting strict = false which allows control characters within string.
https://docs.python.org/2/library/json.html
Upvotes: 1