Reputation: 49
Basically what I am trying to accomplish is whenever a number precedes a string within square brackets, the string within the square brackets should repeat as many times. I am learning regular expression and I posted a question how I can come up with a regular expression for this scenario and folks were kind enough to get me what I wanted. Basically the code below parses the string based on the regular expression and get me a list with all square brackets removed so that I can iterate through the list and get myself the output I want.
Given a string s, the code I have is below. But this does not work in all scenarios. This however works when there are no nested square brackets.
import re
s = "abc3[cd]xyz"
s_final = ""
res = re.findall(r'\d+|[^\W\d_]+', s)
print(res)
for i in range(len(res)):
if res[i].isnumeric():
s_final+=int(res[i])*res[i+1]
else:
if res[i] not in s_final:
s_final += res[i]
print(s_final)
if I change the string to
s = "3[a2[c]]" the expected output is accaccacc as the inner square brackets gets evaluated first and then the outer square brackets. But what I get is aaacc which is not wrong based on the code I have. But anyone have any insights on how I can evaluate the string correctly?
Thanks!
Upvotes: 1
Views: 291
Reputation: 63749
Regardless of what parsing tools are used, I always recommend that developers start by writing a BNF (Backus-Naur Form) for their parser, as a roadmap for the development to follow. Here is what I've gathered from your problem description (where '+' means "one or more"):
string_expr := (letter | repetition)+
repetition := integer '[' string_expr ']'
letter := character a-z
integer := (character 0-9)+
At this point, walk through the BNF with your test samples, and make sure things look correct.
You can see that this is a recursive grammar, since string_expr
uses repetition
in its definition, but repetition
uses string_expr
.
Pyparsing is written to make it easy to map these BNF definitions to pyparsing objects. Follow this annotated code: (need to work bottom-up in implementing your grammar)
import pyparsing as pp
# define []'s, but we'll use pyparsing's Suppress since they won't be
# needed in the actual expression after parsing is done
LBRACK, RBRACK = map(pp.Suppress, "[]")
# a letter will be a single character from the list of lowercase letters
# defined in the Python string module
letter = pp.Char(string.ascii_lowercase)
# an integer will be one or more numeric digits - attach a parse
# action to convert parsed numeric strings into actual Python ints
integer = pp.Word("0123456789").addParseAction(lambda t: int(t[0]))
# we would like to write 'repetition' now, but we haven't defined
# 'string_expr' yet. We can't define that yet because 'repetition' isn't
# defined yet. So let's define 'string_expr' using a Pyparsing
# placeholder - a Forward
string_expr = pp.Forward()
repetition = (integer("multiplier")
+ LBRACK
+ string_expr("content")
+ RBRACK)
# now that repetition is defined, we can "insert" the definition of
# 'string_expr' into the placeholder we created earlier
string_expr <<= (letter | repetition)[...]
# two more parse actions needed:
# - one to do the multiplication of multiplier*content in repetition
# - one to join all the pieces together in string_expr
repetition.addParseAction(lambda t: t.multiplier * list(t.content))
string_expr.addParseAction("".join)
Now test it out using the runTests() method pyparsing defines on parser elements:
string_expr.runTests("""\
abc3[cd]xyz
3[a2[c]]
3[a2[cb]]
""")
Gives:
abc3[cd]xyz
['abccdcdcdxyz']
3[a2[c]]
['accaccacc']
3[a2[cb]]
['acbcbacbcbacbcb']
Upvotes: 1
Reputation: 71461
You can use recursion to build the full string, after first parsing the characters:
import re
s = "3[a2[c]]"
def build_string(d):
while (n:=next(d, None)) is not None and n != ']':
yield n if not (j:=re.findall('\d+$', n)) else n[:-1*len(j[0])]
if j:
_ = next(d)
yield ''.join(build_string(d))*int(j[0])
def get_result(s):
return ''.join(build_string(iter(re.findall('\w+|\[|\]', s))))
print(get_result(s))
Output:
'accaccacc'
Upvotes: 1
Reputation: 627082
You can use
import re
s = "3[a2[c]]"
s_final = s
rx = r'(\d+)\[([^][]*)]'
n = 1
while n:
s_final, n = re.subn(rx, lambda x: x.group(2) * int(x.group(1)), s_final)
print(s_final) # => accaccacc
See the Python demo.
The (\d+)\[([^][]*)]
regex matches and captures into Group 1 any one or more digits, and then matches a [
char, captures into Group 2 any text other than [
and ]
chars up to the closest ]
char.
The n = 1
is a flag that makes re.subn
run at least once.
Then, the while
block is executed to find all the occurrences of the pattern and replace with Group 1 value repeated Group 2 times.
Upvotes: 0