Reputation: 3

Regular expression statement inside a while loop only matching and printing one of several expected matches

I've been struggling with this for a while and I was wondering if there was something obvious I've missed.

As programming learning/practice, I'm trying to put together a simple script for calculating the components of a restriction enzyme digest mix. However, first I need to get a list of enzyme stock concentrations.

I pulled all the individual pages from the New England Biolabs enzyme page, and my goal with this current script is to pull out the name of the enzyme and the concentrations available from the company.

This example works with a local copy of EcoRI (link included at bottom of submission).

use warnings;
use strict;
open(FILE,'productR0101.asp');
my $line;
my $counter;
my $array1;
my $array2;
my $array3;
my $concentration;
my @array4;
$counter = 1;

while ($line = <FILE>) {
    chomp($line);

    if ($counter == 6 ){
        $array1 = $line;
        $counter++;
    }
    else{
        $counter++;
    }

    if ($line =~ m/.{8}units.ml/g) {
        (@array4) =$line =~ m/.{8}units.ml/g;
        print @array4;
    }
}
print "\n".$array1;
exit;

Every file has the enzyme name on the sixth line of the file, so I just pulled that whole line. However, the concentrations are in different locations, so my approach was to read in the file one line at a time, and match to the units/ml tag.

My thinking was that it should print out the match for each line, if there was one, every time the while loop runs, effectively resulting in a string of separate print statements.

This is where I get messed up. There are six different locations in this file with a units/ml tag: three for 20,000 and three for 100,000.

I was expecting six different results printed, but when I run this, only one 100,000 units/ml result is returned.

I've tried all sorts of fixes. I tried concatenating strings, I tried storing it as a string, I tried concatenating it onto another array that never gets touched by the (@array4) = $line =~ m/.{8}units.ml/g line, and it either breaks it or gives the same result.

And finally, I apologize for any weird conventions. I'm still learning Perl, and my first experience programming was with MATLAB.

Also, the $array1, $array2, etc. exist because I was trying to keep track of exactly what was getting put where; my intention is to clean it up once I get it functional.

So does anyone have any ideas about what I'm doing wrong?

EDIT: the data source is the source code to each individual enzyme page. For this example, if you view the page source you get the complete input file I gave to the script.

Upvotes: 0

Answers (3)

Chris

Reputation: 1697

I can't exactly reproduce the behavior you've reported of only getting one of the 100,000 units/ml results, as I'm not exactly sure what your input data is. However, I think the problem is with the regular expression not having any captures. You should put parenthesis around the part of the regex match that you want to be returned to @array4. So instead of this:

@array4 = $line =~ m/.{8}units.ml/g;

Try this:

~~@array4 = $line =~ m/(.{8})units.ml/g;~~

@array4 = $line =~ /(.{8})units.ml/;

EDIT: You also don't want to use the m/ and /g modifiers.

Upvotes: 0

Tim Pietzcker

Reputation: 336138

Are the 20,000 units/ml at the start of the line? Because in that case, .{8} would fail to match - the dot doesn't match newlines, and 20,000_ is only 7 characters.

Upvotes: 1

Borodin

Reputation: 126722

We really need to see the data you are processing, but it looks like you are storing only the last occurrence of /units.ml/ in @array4 because you are reading the file line by line.

I will add to this answer if you supplement your question, but for now I need to know

What your data looks like
What the mysterious /.{8}/ is for
Are you aware that $array1, $array2, and $array3, are scalars, as well as being very bad names for variables?

For now, here is a rewrite of your code using idiomatic Perl, and the $. variable that evaluates to the line number of the file most recently read

use strict;
use warnings;

open my $file, '<', 'productR0101.asp' or die $!;

my $array1;
my @array4;

while (my $line = <$file>) {

  chomp $line;

  $array1 = $line if $. == 6;

  if ($line =~ m/.{8}units.ml/) {
    @array4 = $line =~ m/.{8}units.ml/g;
    print "@array4\n";
  }
}

print "\n".$array1;

Upvotes: 0

Regular expression statement inside a while loop only matching and printing one of several expected matches

Answers (3)

Related Questions