Juto
Juto

Reputation: 1276

Regex greedyness REasking

I have this text $line = "config.txt.1", and I want to match it with regex and extract the number part of it. I am using two versions:

$line = "config.txt.1";

(my $result) = $line =~ /(\d*).*/;    #ver 1, matched, but returns nothing

(my $result) = $line =~ /(\d).*/;     #ver 2, matched, returns 1

(my $result) = $line =~ /(\d+).*/;    #ver 3, matched, returns 1

I think the * was sort of messing things around, I have been looking at this, but still don't the greedy mechanism in the regex engine. If I start from left of the regex, and potentially there might be no digits in the text, so for ver 1, it will match too. But for ver 3, it won't match. Can someone give me an explanation for why it is that and how I should write for what I want? (potentially with a number, not necessarily single digit)

Edit

Requirement: potentially with a number, not necessarily single digit, and match can not capture anything, but should not fail

The output must be as follows (for the above example):

config.txt 1

Upvotes: 0

Views: 93

Answers (5)

fugu
fugu

Reputation: 6578

Use the literal '.' as a reference to match before the number:

   #!/usr/bin/perl 
    use strict;
    use warnings;

my @line = qw(config.txt file.txt config.txt.1 config.foo.2 config.txt.23 differentname.fsdfsdsdfasd.2444);

my (@capture1, @capture2);
foreach (@line){    
my (@filematch) = ($_ =~ /(\w+\.\w+)/); 
my (@numbermatch) = ($_ =~ /\w+\.\w+\.?(\d*)/);
my $numbermatch = $numbermatch[0] // $numbermatch[1];
    push @capture1, @filematch;
    push @capture2, @numbermatch;
}

print "$capture1[$_]\t$capture2[$_]\n" for 0 .. $#capture1;

Output:

config.txt  
file.txt    
config.txt  1
config.foo  2
config.txt  23
differentname.fsdfsdsdfasd  2444

Upvotes: 2

TLP
TLP

Reputation: 67890

You do not need .* at all. These two statements assign the exact same number:

my ($match1) = $str =~ /(\d+).*/;
my ($match1) = $str =~ /(\d+)/;

A regex by default matches partially, you do not need to add wildcards.

The reason your first match does not capture a number is because * can match zero times as well. And since it does not have to match your number, it does not. Which is why .* is actually detrimental in that regex. Unless something is truly optional, you should use + instead.

Upvotes: 1

Dave Sherohman
Dave Sherohman

Reputation: 46197

To capture all digits following a final . and not fail the match if the string doesn't end with digits, use /(?:\.(\d+))?$/

perl -E 'if ("abc.123" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched 123
perl -E 'if ("abc" =~ /(?:\.(\d+))?$/) { say "matched $1" } else { say "match failed" }'
matched

Upvotes: 1

Juto
Juto

Reputation: 1276

Thanks guys, I think I figured out myself what I want:

my ($match) = $line =~ /\.(\d+)?/;    #this will match and capture any digit 
                                      #number if there was one, and not fail
                                      #if there wasn't one

Upvotes: 1

amon
amon

Reputation: 57640

The regex /(\d*).*/ always matches immediately, because it can match zero characters. It translates to match as many digits at this position as possible (zero or more). Then, match as many non-newline characters as possible. Well, the match starts looking at the c of config. Ok, it matches zero digits.

You probably want to use a regex like /\.(\d+)$/ -- this matches an integer number between a period . and the end of string.

Upvotes: 2

Related Questions