Bob Fishel
Bob Fishel

Reputation: 123

Regex Greediness

I have a perl regex that i'm fairly certain should work (perl) but is being too greedy:

regex: (?:.*serial[^\d]+?(\d+).*)

Test string: APPLICATIONSERIALNO123456Plnsn123456te20140728tdrnserialnun12hou

Desired group 1 match: 123456

Actual group 1 Match: 12

I've tried every permutation of lookahead and behind and laziness and I can't get the damn thing to work.

WHAT AM I MISSING.

Thanks!

Upvotes: 1

Views: 277

Answers (2)

ikegami
ikegami

Reputation: 386206

The problem is not greediness; it's case-sensitivity.

Currently your regex matches the 12 at the end of serialnun12 because those are the only digits following serial. The ones you want follow SERIAL. S and s are different characters.

There are two solution.

  1. Use the uppercase characters in the pattern.

    my ($serial) = $string =~ /SERIAL\D*(\d+)/;
    
  2. Use case-insensitive matching.

    my ($serial) = $string =~ /serial\D*(\d+)/i;
    

    There's probably no need for this, but I thought I'd mention it just in case.

Upvotes: 3

zx81
zx81

Reputation: 41838

The Problem is Not Greediness, but Case-Sensitivity

Currently your regex matches the 12 at the end of serialnun12, probably because it is case-sensitive. We have two options: using upper-case, or making the pattern case-insensitive.

Option 1: Use Upper-Case

If you only want 123456, you can use:

SERIALNO\K\d+

The \K tells the engine to drop what was matched so far from the final match it returns.

If you want to match the whole string and capture 123456 to Group 1, use:

.*?SERIAL\D+(\d+).*

Option 2: Turning Case-Sensitivity On using (?i) inline or the i flag

To only match 123456, you can use:

(?i)serial\D+\K\d+

Note that if you use the g flag, this would match both numbers.

If you want to match the whole string and capture 123456 to Group 1, use:

(?i).*?serial\D+(\d+).*

A few tips

  • You can turn case-insensitivity either with the (?i) inline modifier or the i flag at the end of the pattern: /serial\D+\K\d+/i
  • Instead of [^\d], use \D
  • There is no need for a lazy quantifier in something like \D+\d+ because the two tokens are mutually exclusive: there is no danger that the \D will run over the \d

Upvotes: 4

Related Questions