Draconis
Draconis

Reputation: 3461

Matching JSON with a regular expression

I have a JavaScript file containing many object literals:

// lots of irrelevant code
oneParticularFunction({
    key1: "string value",
    key2: 12345,
    key3: "strings which may contain ({ arbitrary characters })"
});
// more irrelevant code

I need to write some Python code to extract these literals.

My first attempt was a regular expression oneParticularFunction\(\{(.*?)\}\);. But this fails if the literal contains a "})".

Since I know the objects will be valid JSON (matched quotes, braces, etc) in a valid JavaScript file, is there a more elegant way to extract them?

(In other words, the difficulty is removing all the other JavaScript code I don't care about.)

EDIT: In the end, I used a regular expression for any objects which don't contain sub-objects...

oneParticularFunction\((\{([^"}]*"[^"]*"[^"}]*)*?[^"]*?\})\);

...and tracked open/close braces by hand for anything with nesting.

Upvotes: 1

Views: 948

Answers (3)

Ibrahim
Ibrahim

Reputation: 6088

Regex code:

(?<=(?:\s\"))[\s\S]+?(?=\")|(?<=(?:\s))\d+

Live example of regex at https://regex101.com/r/bfNkvF/3

To use the previous regex in Python:

import re
text = '''oneParticularFunction({
key1: "string value",
key2: 12345,
key3: "strings which may contain ({ arbitrary characters })"
});'''
for m in re.finditer(r"(?<=(:\s\"))[\s\S]+?(?=\")|(?<=(:\s))\d+", text):
    print('%s' % (m.group(0)))

I tested this code on pythontutor, and it seems to work. You can copy it and paste it there. Let me know if it works on the other object literals.

Upvotes: 2

TallChuck
TallChuck

Reputation: 1972

I was able to use this to remove all brackets from a string without eliminating or mismatching an outer '({' and '})'

while True:
    newstring = re.sub(r'(\(\{.*)\{([^{}]*)\}(.*\}\))', r'\1\2\3', mystring)
    if newstring == mystring:
        break
    mystring = newstring

There are 3 groups here (I know, it's hard to tell). The first is (\(\{.*). This finds your ({ and then whatever comes after it up until it finds the inner most {

We know it is the inner most { because of the second group ([^{}]*). This will match anything that is not a { or }.

Then, (.*\}\)) finds everything after the innermost }.

This whole match is replaced by combining these three groups back together (with the {}'s left out). It repeats this until it finds no more matching braces to replace.

If you wanted to also replace ()'s, you could modify it to

newstring = re.sub(r'(\(\{.*)(\{|\()([^{}()]*)(\}|\))(.*\}\))', r'\1\3\5', mystring)

Upvotes: 1

dagonza
dagonza

Reputation: 259

Why not writing a state machine that reads { and increments a counter on every { and decrements it with every } so when it reaches 0 again, take all the characters in the middle and use the json parser from python to check if it is valid or not? on that way, you can get the benefit of syntactical errors instead of a simple match no match from the regex (remember python is { free so false positives are impossible).

Upvotes: 2

Related Questions