Mer
Mer

Reputation: 23

Perl $1 giving uninitialized value error

I am trying to extract a part of a string and put it into a new variable. The string I am looking at is:

maker-scaffold_26653|ref0016423-snap-gene-0.1

(inside a $gene_name variable)

and the thing I want to match is:

scaffold_26653|ref0016423

I'm using the following piece of code:

my $gene_name;
my $scaffold_name;
if ($gene_name =~ m/scaffold_[0-9]+\|ref[0-9]+/) { 
    $scaffold_name = $1;
    print "$scaffold_name\n";
}

I'm getting the following error when trying to execute:

Use of uninitialized value $scaffold_name in concatenation (.) or string

I know that the pattern is right, because if I use $' instead of $1 I get

-snap-gene-0.1

I'm at a bit of a loss: why will $1 not work here?

Upvotes: 1

Views: 188

Answers (2)

Slade
Slade

Reputation: 1364

To expand on Jens' answer, () in a regex signifies an anonymous capture group. The content matched in a capture group is stored in $1-9+ from left to right, so for example,

/(..):(..):(..)/

on an HH:MM:SS time string will store hours, minutes, and seconds in $1, $2, $3 respectively. Naturally this begins to become unwieldy and is not self-documenting, so you can assign the results to a list instead:

my ($hours, $mins, $secs) = $time =~ m/(..):(..):(..)/;

So your example could bypass the use of $ variables by doing direct assignment:

my ($scaffold_name) = $gene_name =~ m/(scaffold_[0-9]+[|]ref[0-9]+)/;
# $scaffold_name now contains 'scaffold_26653|ref0016423'

You can even get rid of the ugly =~ binding by using for as a topicalizer:

my $scaffold_name;
for ($gene_name) {
    ($scaffold_name) = m/(scaffold_\d+[|]ref\d+)/;
    print $scaffold_name;
}

If things start to get more complex, I prefer to use named capture groups (introduced in Perl v5.10.0):

$gene_name =~ m{
    (?<scaffold_name> # ?<name> creates a named capture group
        scaffold_\d+?  # 'scaffold' and its trailing digits
        [|]            # Literal pipe symbol
        ref\d+         # 'ref' and its trailing digits
    )
}xms; # The x flag lets us write more readable regexes
print $+{scaffold_name}, "\n";

The results of named capture groups are stored in the magic hash %+. Access is done just like any other hash lookup, with the capture groups as the keys. %+ is locally scoped in the same way the $ are, so it can be used as a drop-in replacement for them in most situations.

It's overkill for this particular example, but as regexes start to get larger and more complicated, this saves you the trouble of either having to scroll all the way back up and count anonymous capture groups from left to right to find which of those darn $ variables is holding the capture you wanted, or scan across a long list assignment to find where to add a new variable to hold a capture that got inserted in the middle.

My personal rule of thumb is to assign the results of anonymous captured to descriptively named lexically scoped variables for 3 or less captures, then switch to using named captures, comments, and indentation in regexes when more are necessary.

Upvotes: 3

Jens
Jens

Reputation: 69440

If you want to use a value from the matching you have to make () arround the character in regex

Upvotes: 4

Related Questions