tcpiper
tcpiper

Reputation: 2544

python 3.3 RE How to grab possible groups?

The RE only catches the last group if the group number is unknown (>=0):

>>> re.findall(r"{% url '(\w+)'(?:\s+(\w+))* %}","{% url 'a' b %}")
[('a', 'b')]
>>> re.findall(r"{% url '(\w+)'(?:\s+(\w+))* %}","{% url 'a' b c %}")
[('a', 'c')]
>>> re.findall(r"{% url '(\w+)'(?:\s+(\w+))* %}","{% url 'a' b c e %}")
[('a', 'e')]

How to grap all groups like this (I imagine):

>>> re.findall(r"{% url '(\w+)'(?:\s+(\w+))* %}","{% url 'a' b %}")
[('a', 'b')]
>>> re.findall(r"{% url '(\w+)'(?:\s+(\w+))* %}","{% url 'a' b c %}")
[('a', 'b', 'c')]
>>> re.findall(r"{% url '(\w+)'(?:\s+(\w+))* %}","{% url 'a' b c e %}")
[('a', 'b', 'c', 'e')]

Note , I this is the simple situation which is easy to understand my quesion. So solutions such like s.split() doesn't work for complicate one.

My real need is (Note the whitespace number is unknown(>=1)):

grab ["'funcname'", 'first'] from "{% url 'funcname'    first   %}"
grab ["'funcname'", 'first', 'second'] from "{% url 'funcname'  first    second %}"
grab ["'funcname'", 'first', 'second','third'] from "{% url 'funcname'  first second     third    %}"

Or more complicated:

grab ["'funcname'", 'first','fir'] from "{% url 'funcname'    first = fir   %}"
grab ["'funcname'", 'first','fir', 'second', 'sec'] from "{% url 'funcname'  first=fir    second   = sec %}"
grab ["'funcname'", 'first','fir', 'second', 'sec', 'third', 'thi'] from "{% url 'funcname'  first =fir    second = sec    third=thi    %}"

Upvotes: 0

Views: 47

Answers (1)

Martijn Pieters
Martijn Pieters

Reputation: 1122222

You put a multiplier around the group:

(?:\s+(\w+))*

but groups do not multiply; they have a fixed group number and every match is assigned to that group number. Hence you see only ever the last match.

You'll have to capture all candidates in one group and split afterwards:

[r[:1] + tuple(r[1].split()) 
 for r in re.findall(r"{% url '(\w+)'((?:\s+\w+)*) %}", inputtext)]

Note that the capturing group now captures all of the (?:\s+\w+)* pattern.

Demo:

>>> import re
>>> inputtext = "{% url 'a' b c e %}"
>>> [r[:1] + tuple(r[1].split()) 
...  for r in re.findall(r"{% url '(\w+)'((?:\s+\w+)*) %}", inputtext)]
[('a', 'b', 'c', 'e')]

Your second form is more complex, and requires that you use another regular expression to split out the matches:

from itertools import chain

[r[:1] + tuple(chain(*re.findall(r'(\w+)\s*=\s*(\w+)', r[1])))
 for r in re.findall(r"{% url '(\w+)'((?:\s+\w+\s*=\s*\w+)*) \s*%}", inputtext)]

Demo:

>>> inputtext = "{% url 'funcname'  first =fir    second = sec    third=thi    %}"
>>> [r[:1] + tuple(chain(*re.findall(r'(\w+)\s*=\s*(\w+)', r[1])))
...  for r in re.findall(r"{% url '(\w+)'((?:\s+\w+\s*=\s*\w+)*) \s*%}", inputtext)]
[('funcname', 'first', 'fir', 'second', 'sec', 'third', 'thi')]

Upvotes: 1

Related Questions