user1934428
user1934428

Reputation: 22301

How to match regular expression even less greedy than not-greedy?

This question focuses on pcre-regular expression as used by grep -P.

Imagine I have a string abcRabcSxyxz and search for a substring which starts with abc and ends with x, but with the restriction that no shorter substring of this match would also also match.

My first attempt was a non-greedy regexp,

grep -Po 'abc.*?x' <<<abcRabcSxyxz

but this returns abcRabcSx, while I would like to find just abcSx. It is obvious why even my non-greedy attempt still provides a match which is too long; I need the regexp engine to try harder. My second attempt was

grep -Po '(?>abc.*?)x' <<<abcRabcSxyxz

which did not provide a match at all (maybe I don't really understand the usage of ($?...) explained here).

Any easy solution for my problem anyone?

UPDATE I see from the comments that my example does not precisely explain what i am searching for, so here a more general description:

I am searching for matches of the form PXQ, wher P, X and Q are arbitrary patterns, and X should not contain a match of P. Plus, I don't want to literally retype the pattern P inside X.

For instance

`[(][^(]*[)]`

would be a possible (but not satsifying) solution for the concrete case that I am searching for a parenthesized expression which does not contain another parenthesized (here, P is [(], X is an arbitrary string, and Q is [)]), but even this example shows that I have to literally repeat the information contained in P, when specifying the middle part ([^(]*), to make sure that my P is not contained there). I am looking for a way which makes this explicit repetition unnecessary.

Upvotes: 1

Views: 186

Answers (1)

James Risner
James Risner

Reputation: 6094

Interesting question. Much of this having been worked out in comments, thanks Casimir et Hippolyte, Felix Kling, and user1934428.

The solution uses PCRE and is as follows:

grep -Po '(abc)(?:(?!(?1)).)*?x' <<< abcRabcSxyxz

We know the result will start with "abc" and end in "x". So let us wall through how this result works.

  • We group the expected output (abc) to start.
  • A ( followed by ?: prevents the subpattern from capturing or counted.
  • Next up is a negative look ahead assertions (?!.
  • The subject of the look ahead is matched pattern 1 (in this case abc).
  • The . matches any character, in this case matching the S.
  • Ending the group with )*?, an un-greedy, matching few as zero characters.
  • The final entry is the x, which the question designated as the ending character.

Upvotes: 1

Related Questions