How to remove certain comments from a json file? (/*)

Question

I have about 500 json files with comments in them. Trying to update a field on the json file with a new value, throws an error. I managed to use commentjson to remove strings like this // some text and the json file updates and throws no errors.

But there is about 100 json files with comments like this:

  /*

   1. sometext.
        i. sometext
        ii. sometext 
   2. sometext

  */

Commentjson just crashes when /* exists. If I remove /* and run the code, it will work and update and remove any //. How can I write some code to manage /* and all text between /* */?

This is my current code that can remove //

with open(f"{i['Location']}\{file_name}",'r') as f:
    json_info = commentjson.load(f) #Gets info from the json file
    json_info['password'] = password

    with open(f"{i['location_Daily']}\{file_name}",'w') as f:
        commentjson.dump(json_info,f,indent = 4) #updates the password   
        print("updated")

Martijn Pieters · Accepted Answer

You have a few options:

Read the whole file into a string, then use a regular expression to pre-process the text. E.g.:

with open(...) as f:
    json_text = f.read()
# remove everything from '/*' to '*/' as long as it is either
# - a '*' character that is *not* followed by '/'
# - any character that is not '*'
without_comments = re.sub(r"/\*(?:\*(?!/)|[^*])*\*/", "", json_text)
json_info = commentjson.loads(without_comments)

Note that this approach is not going to work if there are also JSON strings with the /* and */ inside of them. A regex is not a JSON parser.

try to update the parser that the commonjson project uses to parse out JSON. Looking at the project source code, they use the Lark parsing library, so you could monkey patch the module with additional grammar.

I note that the main branch already has a grammar rule defining multi-line comments:

COMMENT: "/*" /(.|\n)+?/ "*/"
       | /(#|\/\/)[^\n]*/

but that is not yet part of their release. You can, however, re-use that rule:

from commentjson import commentjson as implementation
from lark.reconstruct import Reconstructor

serialized = implementation.parser.serialize()
for tok in serialized["parser"]["lexer_conf"]["tokens"]:
    if tok["name"] != "COMMENT":
        continue
    if tok["pattern"]["value"].startswith("(#|"):
        # only supports `#` or `//` comments, add block comments
        tok["pattern"]["value"] = r'(?:/\*(?:\*(?!/)|[^*])*\*/|(#|//)[^
]*)'
    break

implementation.parser = implementation.parser.deserialize(serialized, None, None)

I used my own regex in that grammar update rather than the version used by the project.

Find a different library to parse the input. There are several options that claim to support parsing JSON with the same syntax:
I have not tried any of these nor have anything to say about their usability or performance.

How to remove certain comments from a json file? (/*)

Answers (2)

Related Questions