jeremyforan
jeremyforan

Reputation: 1437

perl regex multiple groups

I am trying to do a screen scrape in perl and have it down to a array of table elements.

the string:

<tr>
        <td>10:11:00</td>
        <td><a href="/page/controller/33">712</a></td>
        <td>Start</td>
        <td>Finish</td>
        <td>200</td>
        <td>44</td>

Code:

if($item =~ /<td>(.*)?<\/td>/)
            {
                print "\t$item\n";
                print "\t1: $1\n";
                print "\t2: $2\n";
                print "\t3: $3\n";
                print "\t4: $4\n";
                print "\t5: $5\n";
                print "\t6: $6\n";
            }

output:

1: 10:11:00
2: 
3: 
4: 
5: 
6: 

I tried multiple thing but could not get the intended results. thoughts?

Upvotes: 1

Views: 210

Answers (2)

perreal
perreal

Reputation: 98068

use strict;
use warnings;

my $item = <<EOF;
<tr>
        <td>10:11:00</td>
        <td><a href="/page/controller/33">712</a></td>
        <td>Start</td>
        <td>Finish</td>
        <td>200</td>
        <td>44</td>
EOF

if(my @v = ($item =~ /<td>(.*)<\/td>/g))
{
  print "\t$item\n";
  print "\t1: $v[0]\n";
  print "\t2: $v[1]\n";
  print "\t3: $v[2]\n";
  print "\t4: $v[3]\n";
  print "\t5: $v[4]\n";
  print "\t6: $v[5]\n";
}

or

if(my @v = ($item =~ /<td>(.*)<\/td>/g))
{
  print "\t$item\n";
  print "\t$_: $v[$_-1]\n" for 1..@v;
}

Output:

1: 10:11:00
2: <a href="/page/controller/33">712</a>
3: Start
4: Finish
5: 200
6: 44

Upvotes: 5

amon
amon

Reputation: 57640

The code behaves exactly as you told it to. This is what happens:

You matched the regex exactly once. It did match, and populated the $1 variable with the value of the first (and only!) capture buffer. The match returns "true", and the code in the if-branch is executed.

You want to do two things:

  1. Match with the /g modifier. This matches globally, and tries to return every match in the string, not just the first one.
  2. Execute the regex in list context, so you can save the capture buffers to an array

This would lead to the following code:

if ( my @matches = ($item =~ /REGEX/g) ) {
  for my $i (1 .. @matches) {
    print "$i: $matches[$i-1]\n";
  }
}

Do also note that parsing HTML with regexes is evil, and you should search CPAN for a module you like that does that for you.

Upvotes: 1

Related Questions