Reputation: 49
I want to extract the text only for heading Node Object Methods from a webpage. The specific HMTL part is as follows:
<h2>Node Object Properties</h2>
<p>The "DOM" column indicates in which DOM Level the property was introduced.</p>
<table class="reference">
<tr>
<th width="23%" align="left">Property</th>
<th width="71%" align="left">Description</th>
<th style="text-align:center;">DOM</th>
</tr>
<tr>
<td><a href="prop_node_attributes.asp">attributes</a></td>
<td>Returns a collection of a node's attributes</td>
<td style="text-align:center;">1</td>
</tr>
<tr>
<td><a href="prop_node_baseuri.asp">baseURI</a></td>
<td>Returns the absolute base URI of a node</td>
<td style="text-align:center;">3</td>
</tr>
<tr>
<td><a href="prop_node_childnodes.asp">childNodes</a></td>
<td>Returns a NodeList of child nodes for a node</td>
<td style="text-align:center;">1</td>
</tr>
<tr>
<td><a href="prop_node_firstchild.asp">firstChild</a></td>
<td>Returns the first child of a node</td>
<td style="text-align:center;">1</td>
</tr>
<tr>
<td><a href="prop_node_lastchild.asp">lastChild</a></td>
<td>Returns the last child of a node</td>
<td style="text-align:center;">1</td>
</tr>
<tr>
<td><a href="prop_node_localname.asp">localName</a></td>
<td>Returns the local part of the name of a node</td>
<td style="text-align:center;">2</td>
</tr>
<tr>
<td><a href="prop_node_namespaceuri.asp">namespaceURI</a></td>
<td>Returns the namespace URI of a node</td>
<td style="text-align:center;">2</td>
</tr>
<tr>
<td><a href="prop_node_nextsibling.asp">nextSibling</a></td>
<td>Returns the next node at the same node tree level</td>
<td style="text-align:center;">1</td>
</tr>
<tr>
<td><a href="prop_node_nodename.asp">nodeName</a></td>
<td>Returns the name of a node, depending on its type</td>
<td style="text-align:center;">1</td>
</tr>
<tr>
<td><a href="prop_node_nodetype.asp">nodeType</a></td>
<td>Returns the type of a node</td>
<td style="text-align:center;">1</td>
</tr>
<tr>
<td><a href="prop_node_nodevalue.asp">nodeValue</a></td>
<td>Sets or returns the value of a node, depending on its
type</td>
<td style="text-align:center;">1</td>
</tr>
<tr>
<td><a href="prop_node_ownerdocument.asp">ownerDocument</a></td>
<td>Returns the root element (document object) for a node</td>
<td style="text-align:center;">2</td>
</tr>
<tr>
<td><a href="prop_node_parentnode.asp">parentNode</a></td>
<td>Returns the parent node of a node</td>
<td style="text-align:center;">1</td>
</tr>
<tr>
<td><a href="prop_node_prefix.asp">prefix</a></td>
<td>Sets or returns the namespace prefix of a node</td>
<td style="text-align:center;">2</td>
</tr>
<tr>
<td><a href="prop_node_previoussibling.asp">previousSibling</a></td>
<td>Returns the previous node at the same node tree level</td>
<td style="text-align:center;">1</td>
</tr>
<tr>
<td><a href="prop_node_textcontent.asp">textContent</a></td>
<td>Sets or returns the textual content of a node and its
descendants</td>
<td style="text-align:center;">3</td>
</tr>
</table>
<h2>Node Object Methods</h2>
<p>The "DOM" column indicates in which DOM Level the method was introduced.</p>
<table class="reference">
<tr>
<th width="33%" align="left">Method</th>
<th width="61%" align="left">Description</th>
<th style="text-align:center;">DOM</th>
</tr>
<tr>
<td><a href="met_node_appendchild.asp">appendChild()</a></td>
<td>Adds a new child node, to the specified node, as the last child node</td>
<td style="text-align:center;">1 </td>
</tr>
<tr>
<td><a href="met_node_clonenode.asp">cloneNode()</a></td>
<td>Clones a node</td>
<td style="text-align:center;">1 </td>
</tr>
<tr>
<td><a href="met_node_comparedocumentposition.asp">compareDocumentPosition()</a></td>
<td>Compares the document position of two nodes</td>
<td style="text-align:center;">1 </td>
</tr>
<tr>
<td>getFeature(<span class="parameter">feature</span>,<span class="parameter">version</span>)</td>
<td>Returns a DOM object which implements the specialized APIs
of the specified feature and version</td>
<td style="text-align:center;">3 </td>
</tr>
<tr>
<td>getUserData(<span class="parameter">key</span>)</td>
<td>Returns the object associated to a key on a this node. The
object must first have been set to this node by calling setUserData with the
same key</td>
<td style="text-align:center;">3 </td>
</tr>
<tr>
<td><a href="met_node_hasattributes.asp">hasAttributes()</a></td>
<td>Returns true if a node has any attributes, otherwise it
returns false</td>
<td style="text-align:center;">2 </td>
</tr>
<tr>
<td><a href="met_node_haschildnodes.asp">hasChildNodes()</a></td>
<td>Returns true if a node has any child nodes, otherwise it
returns false</td>
<td style="text-align:center;">1 </td>
</tr>
<tr>
<td><a href="met_node_insertbefore.asp">insertBefore()</a></td>
<td>Inserts a new child node before a specified, existing, child node</td>
<td style="text-align:center;">1 </td>
</tr>
</table>
In Perl if I write the following:
my $data = scraper {
process "table.reference > tr > td > a", 'renners[]' => 'TEXT';
}
for my $i (0 .. $#{$res2->{renners}}) {
print $res2->{renners}[$i];
print "\n";
}
I get the text for all the tags i.e.:
attributes
baseURI
.
.
.
.
insertBefore()
wheras I need the text of tag <a>
only for Node Object Methods i.e.:
appendChild()
.
.
.
insertBefore()
In short I want to print the NODE object methods only. What should I modify in the code?
Upvotes: 1
Views: 444
Reputation: 66
You can use XPath to extract data from the very next table after the heading Node Object Methods
, like so
use Web::Scraper;
my $html = do { local $/; <DATA> };
my $methods = scraper {
process '//h2[.="Node Object Methods"]/following-sibling::table[1]//tr/td[1]',
'renners[]' => 'TEXT';
};
my $res = $methods->scrape( $html );
say join "\n", @{ $res->{renners} };
The output will be
appendChild()
cloneNode()
compareDocumentPosition()
getFeature(feature,version)
getUserData(key)
hasAttributes()
hasChildNodes()
insertBefore()
Upvotes: 1
Reputation: 1570
Web::Query provides an almost identical solution to the Mojo::DOM solution proposed by brian d foy.
use Web::Query;
my $html = do { local $/; <DATA> };
wq($html)
->find('table.reference:nth-of-type(2) > tr > td > a')
->each(sub {
my ($i, $e) = @_;
say $e->text();
});
However it looks like Mojo::DOM is the more robust library. For Web::Query to correctly match with its selector I had to edit the input provided in the question to add a root node surrounding all the other content.
__DATA__
<html>
...
</html>
Upvotes: 1
Reputation: 132812
Web::Scraper can use nth_of_type
to choose the right table. There are two tables with the same class, so you can say table.reference:nth-of-type(2)
:
use v5.22;
use feature qw(postderef);
no warnings qw(experimental::postderef);
use Web::Scraper;
my $html = do { local $/; <DATA> };
my $methods = scraper {
process "table.reference:nth-of-type(2) > tr > td > a", 'renners[]' => 'TEXT';
};
my $res = $methods->scrape( $html );
say join "\n", $res->{renners}->@*;
And here's a Mojo::DOM:
use Mojo::DOM;
my $html = do { local $/; <DATA> };
my $dom = Mojo::DOM->new( $html );
say $dom
->find( 'table.reference:nth-of-type(2) > tr > td > a' )
->map( 'text' )
->join( "\n" );
I tried looking for a selector solution that could recognize the text in the h2
, but my kung fu is weak here.
Upvotes: 2