Faheem Mitha
Faheem Mitha

Reputation: 6326

debugging a perl assignment

I should explain as background to this question that I don't know any Perl, and have a violent allergy to regular expressions (we all have our weaknesses). I'm trying to figure out why a Perl program won't accept the data I'm feeding it. I don't need to understand this program in any depth - I'm just doing a timing comparison.

Consider this assignment statement:

($sample_ls_id) = $sample_ls_id =~ /:\w\w(\d+):/;

If I understand this correctly, it is checking if sample_ls_id matches some regex, and if so, assigning the entire string, or something like that.

However, I don't understand how this works. According to the documentation, namely perldoc perlretut, which I looked at briefly

$sample_ls_id =~ /:\w\w(\d+):/

just returns true or false if there is a match.

The strings I'm trying to match look like

1000    10      0       0       1        urn:lsid:dcc.hapmap.org:Individual:CEPH1000.10:1        urn:lsid:dcc.hapmap.org:Sample:SAMPLE1:1

This fails with the error

Use of uninitialized value $sample_ls_id in concatenation (.) or string
at database/populate/family.pl line 38, <INPUT> line 1.

Line 38 is

print OUTPUT "$sample_ls_id\t$family_ped_id\t$individual_ped_id\t$father_ped_id\t$mother_ped_id\t$sex\t$created_by\t$population_code\n";

See the complete script below. However, the apparently very similar string

1420    9       0       0       1       urn:lsid:dcc.hapmap.org:Individual:CEPH1420.09:1  urn:lsid:dcc.hapmap.org:Sample:NA12003:1

seems to pass.

For context, the entire piece of code is:

use strict;
use warnings;
use Getopt::Long;

my $input_file = "data/family_ceu.txt";
my $output_file = "sql/family_ceu.sql";
my $population_code = "CEU";

GetOptions ('i=s' => \$input_file,
            'o=s' => \$output_file,
            'p=s' => \$population_code
            );

usagecheck();

my $created_by = 'gwas_analyzer';

print "Creating SQL file for inserting family data from $input_file\n";

open (INPUT, "< $input_file");
open (OUTPUT, "> $output_file");

print OUTPUT "INSERT INTO population (population_code, private) VALUES ('$population_code', 'f');\n";
print OUTPUT "COPY family (ls_id, family_ped_id, individual_ped_id, father_ped_id, mother_ped_id, sex, created_by, population_code) FROM stdin;                      
";

while (my $line = <INPUT>)
{
    chomp $line;

    #Skip any comment lines 
    next if($line =~ /^#/);

    my ($family_ped_id, $individual_ped_id, $father_ped_id, $mother_ped_id, $sex, $individual_ls_id, $sample_ls_id) = split (/\t/, $line);

    ($sample_ls_id) = $sample_ls_id =~ /:\w\w(\d+):/;

    print OUTPUT "$sample_ls_id\t$family_ped_id\t$individual_ped_id\t$father_ped_id\t$mother_ped_id\t$sex\t$created_by\t$population_code\n";
}

print OUTPUT "\\.\n";
close OUTPUT;

sub usagecheck
{
    if (!$input_file || !$output_file || !$population_code)
    {
        print "Missing argument (see required arguments below):\n";
        usage();
        exit;
    }
}

sub usage
{
    print "perl family.pl -i <input file> -o <output file> -p <population code>\n";
}

I'm sure this is a very simple question if you know regexes and Perl.

Upvotes: 2

Views: 166

Answers (3)

Shalini
Shalini

Reputation: 455

When $sample_ls_id = 'urn:lsid:dcc.hapmap.org:Sample:SAMPLE1:1';

The regular expression '/:\w\w(\d+):/;' fails. This regular expression would pass when the string has a colon ':' followed by a "word" character '\w', another "word" character '\w' followed by one or more digits '\d+' and a colon ':'.

When $sample_ls_id = 'urn:lsid:dcc.hapmap.org:Sample:NA12003:1';

The regular expression '/:\w\w(\d+):/;' finds its match in ':NA12003:'. ( colon, 2 word characters, digits and a colon ).

my $sample_id = 'urn:lsid:dcc.hapmap.org:Sample:NA12003:1'
($sample_ls_id) = $sample_ls_id =~ /:\w\w(\d+):/;

'( $sample_ls_id )' captures the '(\d+)' portion of the match ( also stored in $1 ), which in this case would be 12003.

You were getting an error with the earlier example, because the regular expression fails and leaves '($sample_ls_id)' undefined.

Upvotes: 4

onaclov2000
onaclov2000

Reputation: 5841

Rather then storing the string back into itself per se, just use the capture. \d is held by $1, so simply change your code to something like this:

$sample_ls_id =~ /:\w\w(\d+):/; # no letters before implies "match"
$sample_ls_id = $1; # I assume that $1 will be empty if no match, I'm not 100% on this.

I don't know why you're getting the error you're getting, but it seems like your code would make more sense like the above.

It could have something do do with if you're input doesn't have that last element (I.E. you have A:B:C but you need A:B:C:D to store D in the sample ls id, if D is missing then it's never initialized and then the regex wouldn't make sense.)

Also We don't have all the code (line 38 looks like it corresponds to the first line in your while loop), if you post more that might help.

Upvotes: 0

Neil
Neil

Reputation: 55402

In a list context, such as an assignment to ($sample_ls_id), =~ returns a list of the captures. It saves you extracting $1 etc. in a separate statement.

Upvotes: 2

Related Questions