MikeEMKI
MikeEMKI

Reputation: 47

Extracting Links in Perl using TreeBuilder

I'm working on a script to extract a bunch of information into one HTML file. I'm having some difficulty extracting ONLY a specific set of links from the page in question, however.

Here is a rough structure of the site. There are some other headings and paragraphs in between the innercontent div and what I'm showing below.

<div id="innercontent">
<h1>Download here</h1>
<a href="website.pdf"><img src="stuff"></a>
</div>

Now there are multiple links found in the div ID "innercontent," so I'm looking to find a way to either match a string or otherwise to only get the links I want. Keep in mind all of the links I'm looking to grab will be .pdf, so perhaps that can be of some help. I'm pretty sure TreeBuilder can handle this based on the research I've done. Here are two methods I'm trying. I'd prefer to solve it using the first.

# link to pdf of transcript
for ( $mech->look_down(_tag => 'a') ) {
  next unless $_->as_trimmed_text =~ m/pdf/;
  say $_->as_HTML;
}

my @links = $mech->links();
  for my $link ( @links ) {
  print $link->url;
}

I realize the latter method is just going to search the entire page for links, but I'm including it just in case that method is more efficient, or if both of these methods can be combined.

Any help or advice would be greatly appreciated!

Upvotes: 1

Views: 706

Answers (2)

Trenton Trama
Trenton Trama

Reputation: 4930

WWW::Mechanize has the ability to extract links based on quite a few attributes, such as the text that's displayed for the link, the actual link, or id.

For your specific example, you'd fetch the links that are pdfs with:

my @links = $mech->find_all_links(url_regex=>qr/\.pdf$/)

and then do whatever you needed with the resulting array.

You can see the documentation. And this doc will show you the options availiabe.

Upvotes: 1

Borodin
Borodin

Reputation: 126722

Using HTML::TreeBuilder, you have to make two successive calls tro look_down. The first to find div elements with an id attribute of innercontent, and the second to search within those elements to find a elements with an href attribute whose value ends with .pdf

It loks like this

use strict;
use warnings;

use HTML::TreeBuilder;

my $html = <<END;
<div id="innercontent">
<h1>Download here</h1>
<a href="website.pdf"><img src="stuff"></a>
</div>
END

my $tree = HTML::TreeBuilder->new_from_content($html);

for my $div ( $tree->look_down(_tag => 'div', id => 'innercontent') ) {
    my @anchors = $div->look_down(_tag => 'a', href => qr/\.pdf\z/i );
    print $_->attr('href'), "\n" for @anchors;
}

output

website.pdf

I like Mojo::DOM for this, as it allows simple CSS accessors and allows problems to be solved very concisely

Here is a solution using that module. The output is identical to the solution above

use strict;
use warnings;

use Mojo::DOM;

my $html = <<END;
<div id="innercontent">
<h1>Download here</h1>
<a href="website.pdf"><img src="stuff"></a>
</div>
END

my $dom = Mojo::DOM->new($html);

for my $anchor ( $dom->find('div#innercontent a[href]')->each ) {
    my $href = $anchor->attr('href');
    print "$href\n" if $href =~ /\.pdf\z/i;
}

Upvotes: 0

Related Questions