akshitBhatia
akshitBhatia

Reputation: 1151

Fix unquoted keys in JSON-like file so that it uses correct JSON syntax

I have a very large JSON-like file, but it is not using proper JSON syntax: the object keys are not quoted. I'd like to write a script to fix the file, so that I can load it with json.loads.

I need to match all words followed by a colon and replace them with the quoted word. I think the regex is \w+\s*: and that I should use re.sub, but I'm not exactly sure how to do it.

How can I take the following input and get the given output?

# In
{abc : "xyz", cde : {}, fgh : ["hfz"]}
# Out
{"abc" : "xyz", "cde" : {}, "fgh" : ["hfz"]}

# In
{
    a: "b",
    b: {
        c: "d",
        d: []
    },
    e: "f"
}
# Out
{
    "a": "b",
    "b": {
        "c": "d",
        "d": []
    },
    "e": "f"
}

Upvotes: 1

Views: 2363

Answers (3)

mariotomo
mariotomo

Reputation: 9748

I met this old question while looking for ways to parse sloppy JSON shorthand into python.

my input looks like this:

'{lat: 8.5, lon: -80.0}'

and, as said, it has to be sloppy with spaces, it could just as well be:

'{lat:8.5,lon:-80.0}'

I like the YAML hint, but it doesn't go well with sloppy spacing, and I do not wish to add one more dependency to my already longish list, so I tried the regex solution, and it wasn't good enough for my case.

my solution looks like this:

re.sub(r'(\w+)[ ]*(?=:)', r'"\g<1>"', input_string)

it defines one group, holding alphanumeric data, it allows for whitespace to follow, it anchors to a semicolon, it replaces the matched substring with group one, enclosed in double quotes. it leaves alone all the rest. this pattern will not be matched if the key is already quoted.

in particular:

>>> re.sub(r'(\w+)[ ]*(?=:)', r'"\g<1>"', 
... '{abc : "xyz", cde : {a:"b", c: 0}, fgh : ["hfz"], 123: 123}')
'{"abc": "xyz", "cde": {"a":"b", "c": 0}, "fgh": ["hfz"], "123": 123}'
>>> re.sub(r'(\w+)[ ]*(?=:)', r'"\g<1>"', _)
'{"abc": "xyz", "cde": {"a":"b", "c": 0}, "fgh": ["hfz"], "123": 123}'
>>> 

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627344

I suggest matching whole words that are not enclosed into double quotation marks and adding quotation marks around them:

import re
p = re.compile(r'(?<!")\b\w+\b(?!")')
test_str = "{abc : \"xyz\", cde : {}, fgh : [\"hfz\"]}"
print re.sub(p, r'"\g<0>"', test_str)

See IDEONE demo, output:

{"abc" : "xyz", "cde" : {}, "fgh" : ["hfz"]}

Upvotes: 2

schesis
schesis

Reputation: 59198

Rather than a potentially fragile regex solution, you can take advantage of the fact that while your log file isn't valid JSON, it is valid YAML. Using the PyYAML library, you can load it into a Python data structure and then write it back out as valid JSON:

import json
import yaml

with open("original.log") as f:
    data = yaml.load(f)

with open("jsonified.log", "w") as f:
    json.dump(data, f)

Upvotes: 11

Related Questions