MARS
MARS

Reputation: 75

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!

I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.

Here is my regex:

if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
    print $line;
    print $1;
}

And here is the error:

x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.

As I understand it, $1 is supposed to return x. Where is my code going wrong?

Upvotes: 3

Views: 2866

Answers (2)

TLP
TLP

Reputation: 67930

The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:

if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
    print $line;
    print $1;
}

You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:

if ('abcdef' =~ /c/)      # returns true
if ('abcdef' =~ /^c/)     # returns false, match anchored to beginning
if ('abcdef' =~ /c$/)     # returns false, match anchored to end
if ('abcdef' =~ /c.*$/)   # returns true

As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:

if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'

You can also use $&, which contains the entire match, regardless of parentheses.

You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.

'abc cde fgh' =~ /\bde\b/     # no match
'abc cde fgh' =~ /\bcde\b/    # match
'abc cde fgh' =~ /\babc/      # match
'abc cde fgh' =~ /\s+abc/     # no match! there is no whitespace before 'a'

As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:

'aaa-xxx' =~ /\bxxx/          # match

You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.

Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).

So your regex becomes:

/\b(\w+)\b/

Or

/(?<!\S)(\w+)(?!\S)/

Documentation:

Upvotes: 4

Leeft
Leeft

Reputation: 3837

You're not capturing the result:

if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {

If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:

if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
    my $var = $1;
    my $val = $2;
}

Upvotes: 7

Related Questions