fearless_fool
fearless_fool

Reputation: 35169

Limiting captured range in RegEx expression

(Note: the following is using javascript flavored RegExen, in which . does not match newline, but [^] does.)

Imagine I have this text:

chaff more chaff START PATTERN more chaff chaff more chaff START PATTERN juicy stuff juicy stuff juicy stuff END PATTERN chaff chaff START PATTERN more juicy stuff more juicy stuff END PATTERN

... and I want a RegEx with a global flag (g) that captures the juicy stuff. Specifically, I want the first match to be

START PATTERN juicy stuff juicy stuff juicy stuff END PATTERN

and the second match to be

START PATTERN more juicy stuff more juicy stuff END PATTERN

The fly in the ointment is that first START PATTERN. I've spent some time in regex101.com (an awesome tool for those that don't know it), and this one does not work:

/(?:START PATTERN[^]+)?(START PATTERN[^]+END PATTERN)/?

It captures the second group ("more juicy stuff") but not the first. I've also tried various combinations of negative lookahead, but without success.

Ideas?

Upvotes: 3

Views: 59

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626853

You need a tempered greedy token:

START PATTERN(?:(?!(?:START|END) PATTERN)[^])*END PATTERN
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

See the regex demo.

The (?:(?!(?:START|END) PATTERN)[^])* is called a tempered greedy token because the greedy * quantifier is tempered with a negative lookahead. Inside the lookahead we list all the patterns that we do not want to match up to the trailing delimiter.

Note you can add more precision by adding word boundaries if you plan to match literal words START and END:

\bSTART PATTERN\b(?:(?!\b(?:START|END) PATTERN)[^])*\bEND PATTERN

Note that to make it more efficient, we can unroll it:

START PATTERN[^ES]*(?:S(?!TART PATTERN)[^ES]*|E(?!ND PATTERN)[^ES]*)*END PATTERN

See another demo

Upvotes: 2

Related Questions