Reputation: 797
I need to parse monitoring data from a HTML table for logging purposes.
There are multiple tables in the HTML document without any identifiers, so identifying the correct TR requires improvising.
Particular row of interest is:
<TR>
<TD>Signal to Noise Ratio</TD>
<TD>35 dB</TD>
<TD>35 dB</TD>
<!-- MORE TDs continue here... -->
</TR>
Thus, the identifier/constant that can be used is "Signal to Noise Ratio" string in TR to identify the correct TD's of interest in the document.
The number of TD
elements following the first that contains the identifying string in this row is variable. I need to store all integers from those elements as variables, similar to this:
my %data;
my @keys = qw(SNR1 SNR2 SNR3 SNR4);
my $content = LWP::Simple::get("http://192.168.100.1/cmSignalData.htm")
or die "Couldn't get it!";
if ( $content =~ /<TD>(.+?) dB<\/TD>/ ) {
$data{SNR1} = $1;
}
for (@keys) {
print "$_:" . $data{$_} . " ";
}
print "\n";
And then parse other TR
elements in other tables in the exactly same pattern.
Upvotes: 2
Views: 458
Reputation: 89574
You can easily get the values you want with an XPath query, since you are looking for all the following td
nodes at the same level after a specific td
node.
Here's an example using the HTML::TreeBuilder::XPath
module:
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file("yourfile.html");
my @snr = $tree->findvalues('//td[.="Signal to Noise Ratio"]/following-sibling::td');
$tree->delete;
@snr = map /^(\d+)/, @snr;
print join(', ', @snr);
XPath is a language to query the tree representation of an HTML/XML document, the DOM (Document Object Model) tree.
Query details:
// # anywhere in the tree (*)
td # a `td` element with the following "predicate" (embedded in square brackets):
[.="Signal to Noise Ratio"] # predicate: the text content of the current node (figured
# by a dot) is exactly "Signal to Noise Ratio"
/following-sibling::td # 'following-sibling::' is a kind of selector called "axis"
# that selects all nodes with the same parent node after the
# current element.
# 'td' selects only `td` elements in this node-set.
(*) if you want you can be more explicit. Instead of using //td
, you can describe the full path from the root element /html/body/center/table/tbody/tr/td
This approach needs to build the document tree to be able to query it. It is not a fast approach, but the main advantage is that you use the HTML structure instead of a wild text approach.
Note that you could avoid the array map
to extract the digits at the beginning of each items. XPath has several string functions including substring-before
:
//td[.="Signal to Noise Ratio"]/following-sibling::td/substring-before(text(), " dB")
If performance is important, you can try another approach with a pull parser like HTML::TokeParser::Simple
. This is less handy to write, but it's faster because there's no DOM tree to build, and you will save memory because you can read the HTML file as a stream and stop to read it when you want without to load the whole file in memory.
Upvotes: 2
Reputation: 126742
Here's a version using Mojolicious. It pulls the HTML directly from your pastebin repository
The for
loop iterates over all the rows in all tables. Inside it, the @columns
array is set to the text content of all the columns (<td>
elements) in the row
The first element is checked, firstly that it exists, and secondly that it is equal to Signal to Noise Ratio
. If so then the global array @snr
is set to the decimal numbers in the remainder of @columns
, and last
stops the search for the required row
use strict;
use warnings;
use 5.010;
use Mojo;
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get('http://pastebin.com/raw.php?i=73H5peKW')->res->dom;
my @snr;
for my $row ( $dom->find('table tr')->each ) {
my @columns = $row->find('td')->map('text')->each;
next unless @columns;
if ( shift @columns eq 'Signal to Noise Ratio' ) {
@snr = map /(\d+)/, @columns;
last;
}
}
say "@snr";
35 35 34 34 34 34 34 34
Upvotes: 2
Reputation: 21676
Do not parse HTML with Regex. Use a HTML parser.
There are several HTML parser modules available on CPAN. My favorite is Mojo::DOM. You can use it like below:
#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;
my $HTML = <<"EOF";
<table>
<TR>
<TD>Signal to Noise Ratio</TD>
<TD>35 dB</TD>
<TD>35 dB</TD>
</TR>
</table>
EOF
my $dom = Mojo::DOM->new( $HTML );
if ($dom->at('tr td')->text() eq 'Signal to Noise Ratio'){
for my $e ($dom->find('td')->each) {
if($e->text() =~ /(\d+)\sdB/){
print $1."\n";
}
}
}
For a 8 minute video tutorial on Mojo::DOM
and Mojo::UserAgent
check out Mojocast Episode 5
Upvotes: 2