Reputation: 3
how would one go about parsing a random string (which contains all sorts of characters) into something coherent?
For example, string = '{"letters" : '321"}{}"'}{'{}{{}"': "stack{}}{"}'
I'd like to make separate into:
{"letters" : '321"}{}"'}
and {'{}{{}"': "stack{}}{"}
I've tried iterating through string
and counting each open bracket {
and subtracting when a close bracket }
shows up. However this doesn't work because there are instances wherein the brackets are inside ""
or ''
my code was something along the lines of:
list1 = [] # list1 is where we build up the first string
list2 = [] # list2 is where we keep the strings after building
for c in string:
list1.append(c)
if c == "{":
bracket_counter += 1
elif c == "}":
bracket_counter -= 1
if bracket_counter == 0:
list2.append("".join(item))
list1 = []
using this code, the first string that is considered "complete" is {"letters" : '321"}
even though it should be {"letters" : '321"}{}"'}
I'm pretty unfamiliar with regex, so I'm not sure if this is something I should be using it for. Any help is appreciated.
Thanks!
Upvotes: 0
Views: 990
Reputation: 1
This is a python question, but you could port the javascript package dqtokenizer https://www.npmjs.com/package/dqtokenizer code to do this more easily.
testTokenize(`{"letters" : '321"}{}"'}{'{}{{}"': "stack{}}{"}`, {
additionalBoundaryChars: [],
singleCharTokens: ['(', ')', '{', '}', '[', ']', ':'],
});
output:
tokens:
0: {
1: "letters"
2: :
3: '321"}{}"'
4: }
5: {
6: '{}{{}"'
7: :
8: "stack{}}{"
9: }
Upvotes: 0
Reputation:
You would also have to check whether you are in a string or not. A simple way would be to make another variable and skip loops if you are in a string and it is not the closing character.
bracket_counter = 0
quote = ""
list1 = [] # list1 is where we build up the first string
list2 = [] # list2 is where we keep the strings after building
for c in string:
list1.append(c)
if not quote or c == quote: # If quote is blank or found the closing quote
quote = ""
if c == "{":
bracket_counter += 1
elif c == "}":
bracket_counter -= 1
if bracket_counter == 0:
list2.append("".join(item))
list1 = []
elif c in "'\"": # If the character is a quote character
quote = c # Will skip loops until quote is found
If you wanted a regex, you're first would emulate:
{.*?}
But you want to ignore quotes, so you would do:
{((".*?")|('.*?')|.)*?}
Basically, this takes advantage of lazy quantifiers. It tries to find quoted things as "…", then '…' then finally picks any character.
If you would not like to use lazy quantifiers, use the regex:
{("[^"]*"|'[^']*'|[^{}])*}
This gives the code:
import re
def parse(s):
return [group[0] for group in re.findall("({((\".*?\")|('.*?')|.)*?})", s)]
Usage:
>>> string = """{"letters" : '321"}{}"'}{'{}{{}"': "stack{}}{"}"""
>>> parse(string)
['{"letters" : \'321"}{}"\'}', '{\'{}{{}"\': "stack{}}{"}']
>>> print(", ".join(parse(string)))
{"letters" : '321"}{}"'}, {'{}{{}"': "stack{}}{"}
Upvotes: 0
Reputation: 133909
You'd use a regular expression to tokenize your string, and then you'd iterate over these tokens. For example:
SQ = r"'[^']*'" # single-quoted string
DQ = r'"[^"]*"' # double-quoted string
OPS = r'[{}:]' # operators
WS = r'\s+' # whitespace
# add more types as needed...
tokens = '(?:' + '|'.join([OPS, SQ, DQ, WS]) + ')'
pattern = re.compile(tokens, re.DOTALL)
def tokenize(source):
start = 0
end = len(source)
while start < end:
match = pattern.match(source, start)
if match:
yield match.group(0)
else:
raise ValueError('Invalid syntax at character %d' % start)
start = match.end()
Then you can run your for
loop on these tokens:
for token in tokenize(string):
...
The tokens in case of your example input are:
>>> for token in tokenize(string):
... print(token)
'{'
'"letters"'
' '
':'
' '
'\'321"}{}"\''
'}'
'{'
'\'{}{{}"\''
':'
' '
'"stack{}}{"'
'}'
And as you can see, from this you can count the '{'
and '}'
correctly.
Notice that the regular expression above has no notion of escaping the '
or "
in the strings; if you want \
to escape the end letter, and it tokenized properly, you can change the SQ
and DQ
regexes into
SQ = r"'(?:[^\\']|\\.)*'"
DQ = r'"(?:[^\\"]|\\.)*"'
Also, if you want any other characters to be also allowed but not handled specially, you can add the
NON_SPECIAL = r'[^\'"]'
as the last branch to the regex:
tokens = '(?:' + '|'.join([OPS, SQ, DQ, WS, NON_SPECIAL]) + ')'
Upvotes: 2