Reputation: 305
I have a file of lines made as follows:
{u'af': 4, **[a lots of attribute i don't need]**, u'prb_id': **6092**, u'result': [{u'result': [{u'rtt': 0.266, u'ttl': 255, u'from': u'**208.80.155.67**', u'size': 28}, {u'rtt': 0.413, u'ttl': 255, u'from': u'208.80.155.67', u'size': 28}, {u'rtt': 1.565, u'ttl': 255, u'from': u'208.80.155.67', u'size': 28}], u'hop': 1}, {u'result': [{u'rtt': 68.468, u'ttl': 254, u'from': u'**206.126.237.239**', u'size': 68}, {u'rtt': 67.844, u'ttl': 254, u'from': u'206.126.237.239', u'size': 68}, {u'rtt': 70.378, u'ttl': 254, u'from': u'206.126.237.239', u'size': 68}], u'hop': 2}[**a lots of attribute i don't need**]}
I tried to parse it as a JSON
file with:
data = []
with open('prova1') as f:
for line in f:
data.append(json.loads(line))
But I get the following ValueError
:
ValueError: Expecting property name: line 1 column 2 (char 1)
What I need is to take the values prb_id
and every value in from field avoiding duplicates.
My goal is to get a CSV file with the following format:
6092,208.80.155.67,206.126.237.239
How can I parse it using Python?
Upvotes: 2
Views: 126
Reputation: 148890
This is not JSON (*), so the json module cannot decode it. But it looks like Python syntax, so ast.literal_eval
could do a good job with it, but you will lose the order of fields:
data = []
with open('prova1') as f:
for line in f:
data.append(ast.literal_eval(line))
If you later want to extract all from
fields, and as you structure can contain nested dictionnaries and lists, you could extract them recursively with:
def parse_for_key(m, id, k):
""" m is the dictionnary to parse, k the key for the id, k the key to extract"""
def _do_parse(m, k, l): # recursive function passing the list being computed
if isinstance(m, list): # process for a list
for elt in m: # recurse in all elements from the list
_do_parse(elt, k, l)
elif isinstance(m, dict): # process for a dictionnary
if (k in m) and not (m[k] in l): # evt. add value for key if not already there
l.append(m[k])
for elt in m.values():
_do_parse(elt, k, l) # and recurse in values
return l # return the list
return _do_parse(m, k, [m[id]])
You can then use parse_for_key(m, 'prb_id', 'from')
where m is the result of the litteral_eval
of one line and will get something like:
[6026, '83.212.7.42', '83.212.7.41', '62.217.100.63', '83.97.88.69', '62.40.112.165', ...]
(*) JSON requires identifiers to be enclosed in double quotes ("
), and has no notion of the u
prefix for unicode strings.
Upvotes: 1
Reputation: 59090
This is not proper JSON, it is the format Python uses to print dictionaries, but it is not valid JSON. For instance, JSON requires double quotes, not single quotes, and JSON does not allow u"string"
to define a string.
Option 1: transform the string to json (to convince oneself that this is the root of the error):
$ cat ttt.json
{u'af': 4, u'prb_id': 6092, u'result': [{u'result': [{u'rtt': 0.266, u'ttl': 255, u'from': u'208.80.155.67', u'size': 28}, {u'rtt': 0.413, u'ttl': 255, u'from': u'208.80.155.67', u'size': 28}, {u'rtt': 1.565, u'ttl': 255, u'from': u'208.80.155.67', u'size': 28}], u'hop': 1}, {u'result': [{u'rtt': 68.468, u'ttl': 254, u'from': u'206.126.237.239', u'size': 68}, {u'rtt': 67.844, u'ttl': 254, u'from': u'206.126.237.239', u'size': 68}, {u'rtt': 70.378, u'ttl': 254, u'from': u'206.126.237.239', u'size': 68}], u'hop': 2}]}
>>> import json
>>> a = open("ttt.json").read()
>>> json.loads(a)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/[...]__init__.py", line 309, in loads
return _default_decoder.decode(s)
File "/usr/local/[...]decoder.py", line 351, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/[...]decoder.py", line 367, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 1 (char 1)
>>> # But after some replacements,
>>> json.loads(json.loads(re.sub("u'([a-zA-Z0-9_\.]*)'", r'"\1"', a)))
{'prb_id': 6092, 'result': [{'result': [{'rtt': 0.266, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}, {'rtt': 0.413, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}, {'rtt': 1.565, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}], 'hop': 1}, {'result': [{'rtt': 68.468, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}, {'rtt': 67.844, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}, {'rtt': 70.378, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}], 'hop': 2}], 'af': 4}
Option two (this is the preferred option): use ast
to read the string:
>>> import ast
>>> ast.literal_eval(a)
{'prb_id': 6092, 'result': [{'result': [{'rtt': 0.266, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}, {'rtt': 0.413, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}, {'rtt': 1.565, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}], 'hop': 1}, {'result': [{'rtt': 68.468, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}, {'rtt': 67.844, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}, {'rtt': 70.378, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}], 'hop': 2}], 'af': 4}
>>>
Upvotes: 0
Reputation: 510
The problem is json parser expects objects' keys to be strings and string doesn't include unicode prefix in Json spec (see http://www.json.org/json-en.html)
I don't know any method to get json parse unicode prefix correctly.
Are thoses **
in your real data ? If not you can still use this dirty trick (i'm not sure it will work for all cases) :
import json
s = """{u'a': 1, u'l':[u'b', u'c']}"""
exec("d = {}".format(s))
print(d)
print(json.dumps(d))
Outputs :
{u'a': 1, u'l': [u'b', u'c']}
{"a": 1, "l": ["b", "c"]}
The best way is of course to get a well formatted json as input, but i suppose you cannot have that.
Upvotes: 0