RamsesXVII
RamsesXVII

Reputation: 305

Parsing a file with python to get a CSV

I have a file of lines made as follows:

{u'af': 4, **[a lots of attribute i don't need]**, u'prb_id': **6092**, u'result': [{u'result': [{u'rtt': 0.266, u'ttl': 255, u'from': u'**208.80.155.67**', u'size': 28}, {u'rtt': 0.413, u'ttl': 255, u'from': u'208.80.155.67', u'size': 28}, {u'rtt': 1.565, u'ttl': 255, u'from': u'208.80.155.67', u'size': 28}], u'hop': 1}, {u'result': [{u'rtt': 68.468, u'ttl': 254, u'from': u'**206.126.237.239**', u'size': 68}, {u'rtt': 67.844, u'ttl': 254, u'from': u'206.126.237.239', u'size': 68}, {u'rtt': 70.378, u'ttl': 254, u'from': u'206.126.237.239', u'size': 68}], u'hop': 2}[**a lots of attribute i don't need**]}

I tried to parse it as a JSON file with:

data = []
with open('prova1') as f:
    for line in f:
    data.append(json.loads(line))

But I get the following ValueError:

ValueError: Expecting property name: line 1 column 2 (char 1)

What I need is to take the values prb_id and every value in from field avoiding duplicates.

My goal is to get a CSV file with the following format:

6092,208.80.155.67,206.126.237.239

How can I parse it using Python?

Upvotes: 2

Views: 126

Answers (3)

Serge Ballesta
Serge Ballesta

Reputation: 148890

This is not JSON (*), so the json module cannot decode it. But it looks like Python syntax, so ast.literal_eval could do a good job with it, but you will lose the order of fields:

data = []
with open('prova1') as f:
    for line in f:
        data.append(ast.literal_eval(line))

If you later want to extract all from fields, and as you structure can contain nested dictionnaries and lists, you could extract them recursively with:

def parse_for_key(m, id, k):
""" m is the dictionnary to parse, k the key for the id, k the key to extract"""
    def _do_parse(m, k, l): # recursive function passing the list being computed
        if isinstance(m, list): # process for a list
            for elt in m:       # recurse in all elements from the list
                _do_parse(elt, k, l) 
        elif isinstance(m, dict):   # process for a dictionnary
            if (k in m) and not (m[k] in l):   # evt. add value for key  if not already there
                l.append(m[k])
            for elt in m.values():
                _do_parse(elt, k, l)  # and recurse in values
        return l   # return the list
    return _do_parse(m, k, [m[id]])

You can then use parse_for_key(m, 'prb_id', 'from') where m is the result of the litteral_eval of one line and will get something like:

[6026, '83.212.7.42', '83.212.7.41', '62.217.100.63', '83.97.88.69', '62.40.112.165', ...]

(*) JSON requires identifiers to be enclosed in double quotes ("), and has no notion of the u prefix for unicode strings.

Upvotes: 1

damienfrancois
damienfrancois

Reputation: 59090

This is not proper JSON, it is the format Python uses to print dictionaries, but it is not valid JSON. For instance, JSON requires double quotes, not single quotes, and JSON does not allow u"string" to define a string.

Option 1: transform the string to json (to convince oneself that this is the root of the error):

$ cat ttt.json
{u'af': 4, u'prb_id': 6092, u'result': [{u'result': [{u'rtt': 0.266, u'ttl': 255, u'from': u'208.80.155.67', u'size': 28}, {u'rtt': 0.413, u'ttl': 255, u'from': u'208.80.155.67', u'size': 28}, {u'rtt': 1.565, u'ttl': 255, u'from': u'208.80.155.67', u'size': 28}], u'hop': 1}, {u'result': [{u'rtt': 68.468, u'ttl': 254, u'from': u'206.126.237.239', u'size': 68}, {u'rtt': 67.844, u'ttl': 254, u'from': u'206.126.237.239', u'size': 68}, {u'rtt': 70.378, u'ttl': 254, u'from': u'206.126.237.239', u'size': 68}], u'hop': 2}]}
>>> import json
>>> a = open("ttt.json").read()
>>> json.loads(a)
Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/[...]__init__.py", line 309, in loads
        return _default_decoder.decode(s)
      File "/usr/local/[...]decoder.py", line 351, in decode
        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
      File "/usr/local/[...]decoder.py", line 367, in raw_decode
        obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 1 column 1 (char 1)
>>> # But after some replacements,
>>> json.loads(json.loads(re.sub("u'([a-zA-Z0-9_\.]*)'", r'"\1"', a)))
    {'prb_id': 6092, 'result': [{'result': [{'rtt': 0.266, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}, {'rtt': 0.413, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}, {'rtt': 1.565, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}], 'hop': 1}, {'result': [{'rtt': 68.468, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}, {'rtt': 67.844, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}, {'rtt': 70.378, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}], 'hop': 2}], 'af': 4}

Option two (this is the preferred option): use ast to read the string:

>>> import ast
>>> ast.literal_eval(a)
{'prb_id': 6092, 'result': [{'result': [{'rtt': 0.266, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}, {'rtt': 0.413, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}, {'rtt': 1.565, 'size': 28, 'from': '208.80.155.67', 'ttl': 255}], 'hop': 1}, {'result': [{'rtt': 68.468, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}, {'rtt': 67.844, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}, {'rtt': 70.378, 'size': 68, 'from': '206.126.237.239', 'ttl': 254}], 'hop': 2}], 'af': 4}
>>>

Upvotes: 0

C.LECLERC
C.LECLERC

Reputation: 510

The problem is json parser expects objects' keys to be strings and string doesn't include unicode prefix in Json spec (see http://www.json.org/json-en.html)

I don't know any method to get json parse unicode prefix correctly.

Are thoses ** in your real data ? If not you can still use this dirty trick (i'm not sure it will work for all cases) :

import json

s = """{u'a': 1, u'l':[u'b', u'c']}"""
exec("d = {}".format(s))

print(d)
print(json.dumps(d))

Outputs :

{u'a': 1, u'l': [u'b', u'c']}

{"a": 1, "l": ["b", "c"]}

The best way is of course to get a well formatted json as input, but i suppose you cannot have that.

Upvotes: 0

Related Questions