Reputation: 8093
I have text where some of the text is delimited by:
{# xxx #} some text {# zzz #}
I have many occurrences of this pattern throughout my text. I'd like to extract the some text
from the delimiters. How can I do this with a regex?
For example if I have this text:
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled {# xxx #} it to make {# zzz #} a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s {# xxx #} with {# zzz #} the release of Letraset sheets containing Lorem Ipsum passages, and more recently with {# xxx #} desktop publishing software like Aldus PageMaker {# zzz #} including versions of Lorem Ipsum.
I'd like to get a list like:
[it to make, with, desktop publishing software like Aldus PageMaker]
Here's my non-working code:
>>> regex = re.compile(r'{# xxx #}.*({# zzz #}).*?')
>>> re.findall(regex, s) {# zzz #}
I think my difficulties are trying to craft the regex in a non-greedy manner?
Upvotes: 0
Views: 647
Reputation:
You can get non-greedy behavior simply by adding ?
in between the delimiters. Also, you should not have .*?
at the end. It doesn't do anything. {}
are special characters and should probably be escaped. Finally, the parentheses need to be around the part you want to match. That gives you this pattern:
regex = re.compile(r'\{# xxx #\}(.*?)\{# zzz #\}')
To use it, you need a loop that uses something like m = re.match
, then uses m.group(1)
to get the first subgroup (the part in parentheses). You need to use a loop rather than findall
because you can only get the last match for a subgroup.
Upvotes: 2