Reputation: 74

Regex matches but capture group remains uninitialized

I am new to perl and regex. I think I understand the idea and how to use regex, but I got stuck on a problem while writing a script. I have content from some page and I am trying to read some information.

my @rows = split(/<tr(\s)bgcolor=.{8}/,$content);

foreach my $row(@rows){
    if( $row =~/<td\s+nowrap\s+align=.*\s?(bgcolor=.*\s+)?>\w*\s?<\/td>/ig){
    print $1;
    print $file_opt $row."\n";

    # there will be more code later on
    } 
}

This gives me an error that $1 is uninitialized. I understand that happens when pattern does not match the string. But i have regex under if - so if it enters the if, it does match, rigth? As you can see, i printed rows to a file. Each one looks like this:

<td nowrap align="right">DOLNOŚLĄSKIE</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">4</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >1</td><td nowrap align="right">3</td><td nowrap align="right" bgcolor=#D0E0D0 >6</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >2</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >19</td><td nowrap align="right">0</td></tr>

And all of unnecessary things from $content are not in a file. So does this pattern match or not?

Upvotes: 0

Answers (2)

DavidO

Reputation: 13942

From the code in your post, it looks like you are trying to capture the bgcolor attribute for each table cell in a given row. Not all of the cells have a bgcolor set, but some of them do. Here's how you can extract that information using HTML::TreeBuilder:

use HTML::TreeBuilder 5 -weak;

my $html = q{<td nowrap align="right">DOLNOŚLĄSKIE</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">4</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >1</td><td nowrap align="right">3</td><td nowrap align="right" bgcolor=#D0E0D0 >6</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >2</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >19</td><td nowrap align="right">0</td></tr>};

my $t = HTML::TreeBuilder->new_from_content($html);

foreach my $col ( $t->look_down('_tag','tr')->content_list ) {
  print $col->attr('bgcolor'), "\n" if defined $col->attr('bgcolor');
}

I'm sure you need to retrieve more than that, but it's all we are able to determine given the vague description and incomplete code of your question.

But the point is solid; don't parse HTML with regexes, parse HTML with an HTML parser. It's a slightly steeper learning curve at the beginning, but the result will be more robust, easier to maintain, and the skill you learn will be applicable to any HTML document, not just this particular one.

HTML::TreeBuilder comes with some good documentation, but you've got to read a good portion of it to make sense of the whole thing.

There's another HTML parsing module, Mojo::Dom, which comes with the Mojolicious framework. Personally, I find it easier to use, but sometimes when I post examples people seem to jump to the conclusion that they have to load some heavy-weight web framework to use it (which isn't entirely true, but I'm tired of swimming up-stream. ;). You might want to have a look at it and see if it better fits your taste. Here's an example:

use Mojo::DOM;

my $html = q{<td nowrap align="right">DOLNOŚLĄSKIE</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">0</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">4</td><td nowrap align="right" bgcolor=#D0E0D0 >0</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >1</td><td nowrap align="right">3</td><td nowrap align="right" bgcolor=#D0E0D0 >6</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >2</td><td nowrap align="right">1</td><td nowrap align="right" bgcolor=#D0E0D0 >19</td><td nowrap align="right">0</td></tr>};

for my $td ( Mojo::DOM->new($html)->find('td[bgcolor]')->each ) {
  print $td->attr('bgcolor'), "\n";
}

Both of those code examples will produce the following output:

#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0
#D0E0D0

...which probably isn't terribly useful, but is exactly what the code you posted seems to want to capture. At least it's a starting point that you should be able to adapt to your own needs.

I believe the documentation for Mojo::DOM is more approachable, which might just make the difference, especially if you're new to Perl. My recommendation would be to start there, and build your solution around that module. In the longrun you'll be much better off than tearing your hair out using regexes to extract data from HTML.

The Mojolicious distribution installs in under a minute on most systems, and includes the Mojo::DOM module, which on its own is quite light-weight. It's a good option.

Upvotes: 4

DeVadder

Reputation: 1404

Do not handcraft regex to parse html, yadda yadda, now to your actual question:

"But i have regex under if - so if it enters the if, it does match, right?"

In your regex you have a ? quantifier behind your capture group. That means it can (and does on your example) match with finding your capture group either once or no times. If the best match for your regex happens to involve the capture group zero times, then nothing will be captured and $1 remains empty. Get rid of that question mark to make sure your regex only matches when it did actually capture something.

If used like that on your example it works and does capture something.

While one might assume that it will always capture something if it can (as shown here when it does suddenly work without the quantifier) due to the quantifier being greedy, there are so many quantifiers in there, it is just another one that gets to be greedy first.

Upvotes: 2

Regex matches but capture group remains uninitialized

Answers (2)

Related Questions