Reputation: 71
I have an HTML file containing a 2-column table which I want to parse in order to extract pairs of strings representing the columns. The page layout of the HTML (white space, new lines) is arbitrary, hence I can't parse the file line by line.
I recall that you can parse such a thing by slurping the whole file into a string and operating on the entire string, which I'm finding a bit more challenging. I'm trying things like the following:
#!/usr/bin/perl
open(FILE, "Glossary") || die "Couldn't open file\n";
@lines = <FILE>;
close(FILE);
$data = join(' ', @lines);
while ($data =~ /<tr>.*(<td>.*<\/td>).*(<td>.*<\/td>).*<\/tr>/g) {
print $1, ":", $2, "\n";
}
which gives a null
output. Here's a section of the input file:
<table class="wikitable">
<tr>
<td><b>Term</b>
</td>
<td><b>Meaning</b>
</td></tr>
<tr>
<td><span id="0-Day">0-Day</span>
</td>
<td>
<p>See <a href="#Zero_Day">Zero Day</a>.
</p>
</td>
Can someone help me out?
Upvotes: 0
Views: 469
Reputation: 69224
You already have answers explaining why you shouldn't parse HTML with regexes. And you really shouldn't. But you've asked for an explanation of why your code doesn't work. So here goes...
You have two problems in your code. One stops it working and the other stops it working as you expect.
Firstly, you are using .
in your regex to match any character. But .
doesn't match any character. It matches any character except a newline. And you have newlines in your string. You fix that by adding the /s
option to your match operator (so it has /gs
instead of /s
).
With that fix in place, you get a result from your code. Using your test data, I see:
<td><b>Term</b>
</td>:<td><b>Meaning</b>
</td>
Which is correct. But looking at your test data, I wondered why I wasn't getting two results - because of the /g
. I soon realised it was because your test data is missing the closing </td>
. When I added that, I got this result:
<td><span id="0-Day">0-Day</span>
</td>:<td>
<p>See <a href="#Zero_Day">Zero Day</a>.
</p>
</td>
Ok. It's now finding the second result. But what has happened to the first one? That's the second error in your code.
You have .*
a few times in your regex. That means "zero or more of any character". But it's the "or more" that is a problem here. By default, Perl regex qualifiers (*
or +
) are greedy. That means they will use up as much of the string as possible. And the first .*
in your regex is eating up a lot of your string. All of it up to the second <tr>
in fact.
The solution to that is to make the .*
non-greedy. And you do that by adding ?
to the end. So you can replace all of the .*
with .*?
. Having done that, I get this output:
<td><b>Term</b>
</td>:<td><b>Meaning</b>
</td>
<td><span id="0-Day">0-Day</span>
</td>:<td>
<p>See <a href="#Zero_Day">Zero Day</a>.
</p>
</td>
Which seems correct to me.
So, to summarise:
.
doesn't match newlines. To do that, you need /s
.Upvotes: 1
Reputation: 13792
There is a HTML::TableExtract module in CPAN, which simplifies the problem you are trying to solve:
use strict;
use warnings;
use HTML::TableExtract qw(tree);
my $te = HTML::TableExtract->new( headers => qw(Term Meaning) );
my $html_file = "Glossary";
$te->parse_file($html_file);
my $table = $te->first_table_found;
# ...
Upvotes: 4