jaySon
jaySon

Reputation: 815

RegEx grab text between two specific strings

Say I had the line

"The quick brown fox jumps over the lazy dog"

and I wanted to grab everything between "brown" and "over", where the boundary words may also be substrings of other words. So I am trying to tell the RegEx something like

"grab everything in this line beginning at the string brown until you find the string over"

So I did

brown[^("over")]*

but the result is brown f, because "fox" contains an "o" which is contained in "over".

I just couldn't find a solution to this and the so I hope you can help.

Upvotes: 2

Views: 311

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626691

Alroght, to match really anything between 2 substrings (where the trailing part must be the left-most match, i.e. closest to the leading substring) can be achieved best with the help of a unrolling-the-loop method that invloves the use of negated character classes (sometimes, with a look-ahead).

Here is one for your case:

\bbrown\b[^o]*(?:o(?!ver\b)[^o]*)*\bover\b

See the regex demo

Note that basically this expression is synonymic to (?s)\bbrown\b.*?\bover\b where .*? matches 0 or more any characters, but as few as possible to return a valid match. However, it involves much less backtracking since it is linear.

The unrolled lazy matching is turned into [^o]*(?:o(?!ver\b)[^o]*)* here. Negated character class [^o] matches any character but o. Thus, we do not have to worry about matching newlines.

The \b word boundaries help match whole words only. If you need no whole word matching, just remove all \b from the pattern.

Here is my regex breakdown:

  • \bbrown\b - matches brown as a whole word
  • [^o]* - 0 or more characters other than o
  • (?:o(?!ver\b)[^o]*)* - 0 or more sequences of o that is not followed by ver ((?!ver\b)) and followed by 0 or more characters other than o ([^o]*)
  • \bover\b - matches a whole word over.

Upvotes: 2

Related Questions