Using re to capture text between key words over the course of a doc

Question

I am trying to capture text between key words in a document and the keys words themselves.

For example, let's say I have multiple instances of "egg" in a string. I want to capture each work between "egg" and "egg."

I have tried:

import re
text = "egg hashbrowns egg bacon egg fried milk egg"
re.findall(r"(/egg) (.*) (/egg)", text)

I have also tried re.match and re.search.

What I usually get is ("egg"), ("hashbrowns egg bacon egg fried milk"), ("egg")

What I need to get is (egg, hashbrown, egg), (egg, bacon egg), (egg, fried, milk, egg).

I would appreciate any help on this matter.

aquavitae · Accepted Answer

You need to use a non-greedy match. The *? is a non-greedy form of *, and matches the smallest possible sequence. Also, /egg matches exactly that, but I assume you just want egg, so your actual regex becomes (egg) (.*?) (egg). However, since regular expressions consume the string as it is matched, you need to use look-ahead and look-behind assertions to match the intermediate text. In this case, (?<=egg) (.*?) (?=egg) finds text with "egg" before and after, but only returns the inbetween stuff, i.e. ['hashbrowns', 'bacon', 'fried milk']. Trying to match "egg" too would be quite a lot more complicated, and would probably involve parsing the string twice, so its only worth going into it if that's actually what you want.

All this is documented in the python docs, so look there for more info.

Using re to capture text between key words over the course of a doc

Answers (1)

Related Questions