turtle
turtle

Reputation: 8093

match non-greedy regex delimiters

I have text where some of the text is delimited by:

{# xxx #} some text {# zzz #}

I have many occurrences of this pattern throughout my text. I'd like to extract the some text from the delimiters. How can I do this with a regex?

For example if I have this text:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled {# xxx #} it to make {# zzz #} a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s {# xxx #} with {# zzz #} the release of Letraset sheets containing Lorem Ipsum passages, and more recently with {# xxx #} desktop publishing software like Aldus PageMaker {# zzz #} including versions of Lorem Ipsum.

I'd like to get a list like:

[it to make, with, desktop publishing software like Aldus PageMaker]

Here's my non-working code:

>>> regex = re.compile(r'{# xxx #}.*({# zzz #}).*?')

>>> re.findall(regex, s) {# zzz #}

I think my difficulties are trying to craft the regex in a non-greedy manner?

Upvotes: 0

Views: 647

Answers (1)

user1919238
user1919238

Reputation:

You can get non-greedy behavior simply by adding ? in between the delimiters. Also, you should not have .*? at the end. It doesn't do anything. {} are special characters and should probably be escaped. Finally, the parentheses need to be around the part you want to match. That gives you this pattern:

 regex = re.compile(r'\{# xxx #\}(.*?)\{# zzz #\}')

To use it, you need a loop that uses something like m = re.match, then uses m.group(1) to get the first subgroup (the part in parentheses). You need to use a loop rather than findall because you can only get the last match for a subgroup.

Upvotes: 2

Related Questions