Reputation: 3461
I have a JavaScript file containing many object literals:
// lots of irrelevant code
oneParticularFunction({
key1: "string value",
key2: 12345,
key3: "strings which may contain ({ arbitrary characters })"
});
// more irrelevant code
I need to write some Python code to extract these literals.
My first attempt was a regular expression oneParticularFunction\(\{(.*?)\}\);
. But this fails if the literal contains a "})".
Since I know the objects will be valid JSON (matched quotes, braces, etc) in a valid JavaScript file, is there a more elegant way to extract them?
(In other words, the difficulty is removing all the other JavaScript code I don't care about.)
EDIT: In the end, I used a regular expression for any objects which don't contain sub-objects...
oneParticularFunction\((\{([^"}]*"[^"]*"[^"}]*)*?[^"]*?\})\);
...and tracked open/close braces by hand for anything with nesting.
Upvotes: 1
Views: 948
Reputation: 6088
Regex code:
(?<=(?:\s\"))[\s\S]+?(?=\")|(?<=(?:\s))\d+
Live example of regex at https://regex101.com/r/bfNkvF/3
To use the previous regex in Python:
import re
text = '''oneParticularFunction({
key1: "string value",
key2: 12345,
key3: "strings which may contain ({ arbitrary characters })"
});'''
for m in re.finditer(r"(?<=(:\s\"))[\s\S]+?(?=\")|(?<=(:\s))\d+", text):
print('%s' % (m.group(0)))
I tested this code on pythontutor, and it seems to work. You can copy it and paste it there. Let me know if it works on the other object literals.
Upvotes: 2
Reputation: 1972
I was able to use this to remove all brackets from a string without eliminating or mismatching an outer '({' and '})'
while True:
newstring = re.sub(r'(\(\{.*)\{([^{}]*)\}(.*\}\))', r'\1\2\3', mystring)
if newstring == mystring:
break
mystring = newstring
There are 3 groups here (I know, it's hard to tell). The first is (\(\{.*)
. This finds your ({
and then whatever comes after it up until it finds the inner most {
We know it is the inner most {
because of the second group ([^{}]*)
. This will match anything that is not a {
or }
.
Then, (.*\}\))
finds everything after the innermost }
.
This whole match is replaced by combining these three groups back together (with the {}
's left out). It repeats this until it finds no more matching braces to replace.
If you wanted to also replace ()
's, you could modify it to
newstring = re.sub(r'(\(\{.*)(\{|\()([^{}()]*)(\}|\))(.*\}\))', r'\1\3\5', mystring)
Upvotes: 1
Reputation: 259
Why not writing a state machine that reads { and increments a counter on every { and decrements it with every } so when it reaches 0 again, take all the characters in the middle and use the json parser from python to check if it is valid or not? on that way, you can get the benefit of syntactical errors instead of a simple match no match from the regex (remember python is { free so false positives are impossible).
Upvotes: 2