Chris Kraus
Chris Kraus

Reputation: 71

Perl regex \d+ and [0-9] operators show only single digit in a alpha-numeric string

I encountered the following problem: If I use the code in the first example the variable $1 includes only the last digit of each string. However, if I use the third example where each "string" is just a number the $1 variable shows the full number with all digits. To me it appears that the \d+ operator works differently in alpha-numeric context and just numeric context.

Here are my questions: Can you reproduce this? Is this behavior intended? How can I capture the full number in the alpha-numeric context using a regex operation in perl? If the nature of the \d operator is by nature lazy, can I make it more greedy (if true, how would i do it?)?

Example 1:

perl -e 'for ($i = 199; $i < 201; $i ++) { print "words".$i."words\n"}' | perl -ne 'if (/\A\w+(\d+)\w+/) {$num = $1; print $num,"\n";}'

Output:

9
0

Example 2:

perl -e 'for ($i = 199; $i < 201; $i ++) { print "words".$i."words\n"}' | perl -ne 'if (/\A\w+([0-9]+)\w+/) {$num = $1; print $num,"\n";}'

Output:

9
0

Example 3:

perl -e 'for ($i = 199; $i < 201; $i ++) { print "words".$i."words\n"}' | perl -ne 'if (/(\d+)/) {$num = $1; print $num,"\n";}'

Output:

199
200

Thanks in advance. Any help is highly appreciated.

Best, Chris

Upvotes: 3

Views: 1907

Answers (2)

David Verdin
David Verdin

Reputation: 490

the problem is that digits are matched by \w.

You should replace "\w" with "\D" ("not digit"). For example :

perl -e 'for ($i = 199; $i < 201; $i ++) { print "words".$i."words\n"}' | perl -ne 'if (/\A\D+(\d+)\D+/) {$num = $1; print $num,"\n";}'

Output:

199
200

Of course, if your data can contain more than one occurrence of digits in a single string, you'll need some more precise regexp.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627110

The results you get are expected. In /\A\w+(\d+)\w+/, the first \w+ is a greedy pattern and will grab as many chars as it can match, and since \w also matches digits.

Either use lazy quantifier - /\A\w+?(\d+)\w+/, or subtract the digit from \w (e.g. like in /\A[^\W\d]+(\d+)\w+/). The \w+? will match 1 or more word chars (letters/digits/_) as few as possible, and [^\W\d] matches any letters or _ symbols, thus, no need to use a lazy quantifier with this pattern.

Upvotes: 4

Related Questions