Lewis Green
Lewis Green

Reputation: 77

How to remove certain comments from a json file? (/*)

I have about 500 json files with comments in them. Trying to update a field on the json file with a new value, throws an error. I managed to use commentjson to remove strings like this // some text and the json file updates and throws no errors.

But there is about 100 json files with comments like this:

  /*

   1. sometext.
        i. sometext
        ii. sometext 
   2. sometext

  */

Commentjson just crashes when /* exists. If I remove /* and run the code, it will work and update and remove any //. How can I write some code to manage /* and all text between /* */?

This is my current code that can remove //

with open(f"{i['Location']}\\{file_name}",'r') as f:
    json_info = commentjson.load(f) #Gets info from the json file
    json_info['password'] = password

    with open(f"{i['location_Daily']}\\{file_name}",'w') as f:
        commentjson.dump(json_info,f,indent = 4) #updates the password   
        print("updated")

Upvotes: 2

Views: 1223

Answers (2)

Martijn Pieters
Martijn Pieters

Reputation: 1123520

You have a few options:

  • Read the whole file into a string, then use a regular expression to pre-process the text. E.g.:

    with open(...) as f:
        json_text = f.read()
    # remove everything from '/*' to '*/' as long as it is either
    # - a '*' character that is *not* followed by '/'
    # - any character that is not '*'
    without_comments = re.sub(r"/\*(?:\*(?!/)|[^*])*\*/", "", json_text)
    json_info = commentjson.loads(without_comments)
    

    Note that this approach is not going to work if there are also JSON strings with the /* and */ inside of them. A regex is not a JSON parser.

  • try to update the parser that the commonjson project uses to parse out JSON. Looking at the project source code, they use the Lark parsing library, so you could monkey patch the module with additional grammar.

    I note that the main branch already has a grammar rule defining multi-line comments:

    COMMENT: "/*" /(.|\\n)+?/ "*/"
           | /(#|\\/\\/)[^\\n]*/
    

    but that is not yet part of their release. You can, however, re-use that rule:

    from commentjson import commentjson as implementation
    from lark.reconstruct import Reconstructor
    
    serialized = implementation.parser.serialize()
    for tok in serialized["parser"]["lexer_conf"]["tokens"]:
        if tok["name"] != "COMMENT":
            continue
        if tok["pattern"]["value"].startswith("(#|"):
            # only supports `#` or `//` comments, add block comments
            tok["pattern"]["value"] = r'(?:/\*(?:\*(?!/)|[^*])*\*/|(#|\/\/)[^\n]*)'
        break
    
    implementation.parser = implementation.parser.deserialize(serialized, None, None)
    

    I used my own regex in that grammar update rather than the version used by the project.

  • Find a different library to parse the input. There are several options that claim to support parsing JSON with the same syntax:

    I have not tried any of these nor have anything to say about their usability or performance.

Upvotes: 2

Niel Godfrey P. Ponciano
Niel Godfrey P. Ponciano

Reputation: 10709

You can use another library such as json5 or pyjson5 or anything that supports JSON5

import json5
import pyjson5

data = '''
{
    "something": [
        ["any"],
        ["thing", "here", 10]    // This is comment 1
    ],
    /* While this
    is
    comment 2 */
    "car": [
        ["and", "another", "here"], /* Last comment */
    ]
}
'''

print(json5.loads(data))
print(pyjson5.loads(data))

Output

$ python3 script.py 
{'something': [['any'], ['thing', 'here', 10]], 'car': [['and', 'another', 'here']]}
{'something': [['any'], ['thing', 'here', 10]], 'car': [['and', 'another', 'here']]}

Upvotes: 6

Related Questions