DominiqueBal
DominiqueBal

Reputation: 797

Parse HTML tables for varying number of columns

I need to parse monitoring data from a HTML table for logging purposes.

There are multiple tables in the HTML document without any identifiers, so identifying the correct TR requires improvising.

Particular row of interest is:

<TR>
    <TD>Signal to Noise Ratio</TD>
    <TD>35 dB</TD>
    <TD>35 dB</TD>
    <!-- MORE TDs continue here... -->
</TR>

Thus, the identifier/constant that can be used is "Signal to Noise Ratio" string in TR to identify the correct TD's of interest in the document.

The number of TD elements following the first that contains the identifying string in this row is variable. I need to store all integers from those elements as variables, similar to this:

my %data;
my @keys = qw(SNR1 SNR2 SNR3 SNR4);

my $content = LWP::Simple::get("http://192.168.100.1/cmSignalData.htm")
    or die "Couldn't get it!";

if ( $content =~ /<TD>(.+?) dB<\/TD>/ ) {
    $data{SNR1} = $1;
} 

for (@keys) {
    print "$_:" . $data{$_} . " ";
}
print "\n";

And then parse other TR elements in other tables in the exactly same pattern.

Upvotes: 2

Views: 458

Answers (3)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89574

You can easily get the values you want with an XPath query, since you are looking for all the following td nodes at the same level after a specific td node.

Here's an example using the HTML::TreeBuilder::XPath module:

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file("yourfile.html");

my @snr = $tree->findvalues('//td[.="Signal to Noise Ratio"]/following-sibling::td');
$tree->delete;

@snr = map /^(\d+)/, @snr;
print join(', ', @snr);

XPath is a language to query the tree representation of an HTML/XML document, the DOM (Document Object Model) tree.

Query details:

//   # anywhere in the tree (*)
td   # a `td` element with the following "predicate" (embedded in square brackets):

[.="Signal to Noise Ratio"] # predicate: the text content of the current node (figured
                            # by a dot) is exactly "Signal to Noise Ratio"

/following-sibling::td # 'following-sibling::' is a kind of selector called "axis"
                       # that selects all nodes with the same parent node after the
                       # current element.
                       # 'td' selects only `td` elements in this node-set.

(*) if you want you can be more explicit. Instead of using //td, you can describe the full path from the root element /html/body/center/table/tbody/tr/td

This approach needs to build the document tree to be able to query it. It is not a fast approach, but the main advantage is that you use the HTML structure instead of a wild text approach.

Note that you could avoid the array map to extract the digits at the beginning of each items. XPath has several string functions including substring-before:

//td[.="Signal to Noise Ratio"]/following-sibling::td/substring-before(text(), " dB")

If performance is important, you can try another approach with a pull parser like HTML::TokeParser::Simple. This is less handy to write, but it's faster because there's no DOM tree to build, and you will save memory because you can read the HTML file as a stream and stop to read it when you want without to load the whole file in memory.

Upvotes: 2

Borodin
Borodin

Reputation: 126742

Here's a version using Mojolicious. It pulls the HTML directly from your pastebin repository

The for loop iterates over all the rows in all tables. Inside it, the @columns array is set to the text content of all the columns (<td> elements) in the row

The first element is checked, firstly that it exists, and secondly that it is equal to Signal to Noise Ratio. If so then the global array @snr is set to the decimal numbers in the remainder of @columns, and last stops the search for the required row

use strict;
use warnings;
use 5.010;

use Mojo;

my $ua = Mojo::UserAgent->new;

my $dom = $ua->get('http://pastebin.com/raw.php?i=73H5peKW')->res->dom;

my @snr;

for my $row ( $dom->find('table tr')->each ) {
    my @columns = $row->find('td')->map('text')->each;
    next unless @columns;
    if ( shift @columns eq 'Signal to Noise Ratio' ) {
        @snr = map /(\d+)/, @columns;
        last;
    }
}

say "@snr";

output

35 35 34 34 34 34 34 34

Upvotes: 2

Chankey Pathak
Chankey Pathak

Reputation: 21676

Do not parse HTML with Regex. Use a HTML parser.

There are several HTML parser modules available on CPAN. My favorite is Mojo::DOM. You can use it like below:

#!/usr/bin/perl
use strict;
use warnings;
use Mojo::DOM;

my $HTML = <<"EOF";
<table>
<TR>
<TD>Signal to Noise Ratio</TD>
<TD>35 dB</TD>
<TD>35 dB</TD>
</TR>
</table>
EOF

my $dom = Mojo::DOM->new( $HTML );

if ($dom->at('tr td')->text() eq 'Signal to Noise Ratio'){
   for my $e ($dom->find('td')->each) {
      if($e->text() =~ /(\d+)\sdB/){
          print $1."\n";
      }
   }
}

For a 8 minute video tutorial on Mojo::DOM and Mojo::UserAgent check out Mojocast Episode 5

Upvotes: 2

Related Questions