Reputation: 89
I'm extracting special links within an HTML page by using WWW::Mechanize
.
my $mech = WWW::Mechanize->new();
$mech->get( $uri );
my @links = $mech->find_all_links(url_regex => qr/cgi-bin/);
for my $link ( @links ) {
# try to get everything between <a href="[...]">HERE</a>
}
The links look like this
<a href="[...]"><div><div><span>foo bar</span> I WANT THIS TEXT</div></div></a>
By using $link->text
I get foo bar I WANT THIS TEXT
without knowing which text was inside the <span>
element.
Is there any way to get the raw HTML code instead of the stripped text?
In other words I need to find a way to only get I WANT THIS TEXT
without knowing the exact text within the <span>
tag.
Upvotes: 1
Views: 461
Reputation: 126722
As simbabque has said you can't do that with WWW::Mechanize
In fact there's very little point in using WWW::Mechanize
if you don't want any of its features. If all you're using it for is to fetch a web page, then use LWP::UserAgent
instead. WWW::Mechanize
is just a subclass of LWP::UserAgent
with lots of additional stuff that you don't want
Here's an example that uses HTML::TreeBuilder
to construct a parse tree of the HTML and locate the links that you want. I've used HTML::TreeBuilder
because it's pretty good at tolerating malformed HTML in a way similar to modern browsers
I've been unable to test it as you haven't provided proper sample data and I'm not inclined to create my own
use strict;
use warnings 'all';
use feature 'say';
use WWW::Mechanize;
use HTML::TreeBuilder;
my $mech = WWW::Mechanize->new;
$mech->get('http://www.example.com/');
my $tree = HTML::TreeBuilder->new_from_content($mech->content);
for my $link ( @{ $tree->extract_links('a') } ) {
my ($href, $elem, $attr, $tag) = @$link;
# Exclude non-CGI links
next unless $link =~ /cgi-bin/;
# Find all immediate child text nodes and concatenate them
# References are non-text children
my $text = join ' ', grep { not ref } $elem->content_list;
next unless $text =~ /\S/;
# Trim and consolidate spaces
$text =~ s/\A\s+|\s+\z//g;
$text =~ s/\s+/ /g;
say $text;
}
Upvotes: 3