anon
anon

Reputation:

How can I extract URL and link text from HTML in Perl?

I previously asked how to do this in Groovy. However, now I'm rewriting my app in Perl because of all the CPAN libraries.

If the page contained these links:

<a href="http://www.google.com">Google</a>

<a href="http://www.apple.com">Apple</a>

The output would be:

Google, http://www.google.com
Apple, http://www.apple.com

What is the best way to do this in Perl?

Upvotes: 22

Views: 33450

Answers (11)

draegtun
draegtun

Reputation: 22560

I like using pQuery for things like this...

use pQuery;

pQuery( 'http://www.perlbuzz.com' )->find( 'a' )->each(
    sub {
        say $_->innerHTML . q{, } . $_->getAttribute( 'href' );
    }
);

Also checkout this previous stackoverflow.com question Emulation of lex like functionality in Perl or Python for similar answers.

Upvotes: 6

user13107
user13107

Reputation: 3479

HTML::LinkExtractor is better than HTML::LinkExtor

It can give both link text and URL.

Usage:

 use HTML::LinkExtractor;
 my $input = q{If <a href="http://apple.com/"> Apple </a>}; #HTML string
 my $LX = new HTML::LinkExtractor(undef,undef,1);
 $LX->parse(\$input);
 for my $Link( @{ $LX->links } ) {
        if( $$Link{_TEXT}=~ m/Apple/ ) {
            print "\n LinkText $$Link{_TEXT} URL $$Link{href}\n";
        }
    }

Upvotes: 3

Sherm Pendley
Sherm Pendley

Reputation: 13612

Have a look at HTML::LinkExtractor and HTML::LinkExtor, part of the HTML::Parser package.

HTML::LinkExtractor is similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

Upvotes: 10

Deiveegaraja Andaver
Deiveegaraja Andaver

Reputation: 840

We can use regular expression to extract the link with its link text. This is also the one way.

local $/ = '';
my $a = <DATA>;

while( $a =~ m/<a[^>]*?href=\"([^>]*?)\"[^>]*?>\s*([\w\W]*?)\s*<\/a>/igs )
{   
    print "Link:$1 \t Text: $2\n";
}


__DATA__

<a href="http://www.google.com">Google</a>

<a href="http://www.apple.com">Apple</a>

Upvotes: -1

Aaron Graves
Aaron Graves

Reputation: 69

If you're adventurous and want to try without modules, something like this should work (adapt it to your needs):

#!/usr/bin/perl

if($#ARGV < 0) {
  print "$0: Need URL argument.\n";
  exit 1;
}

my @content = split(/\n/,`wget -qO- $ARGV[0]`);
my @links = grep(/<a.*href=.*>/,@content);

foreach my $c (@links){
  $c =~ /<a.*href="([\s\S]+?)".*>/;
  $link = $1;
  $c =~ /<a.*href.*>([\s\S]+?)<\/a>/;
  $title = $1;
  print "$title, $link\n";
}

There's likely a few things I did wrong here, but it works in a handful of test cases I tried after writing it (it doesn't account for things like <img> tags, etc).

Upvotes: 6

Ashley
Ashley

Reputation: 4335

Previous answers were perfectly good and I know I’m late to the party but this got bumped in the [perl] feed so…

XML::LibXML is excellent for HTML parsing and unbeatable for speed. Set recover option when parsing badly formed HTML.

use XML::LibXML;

my $doc = XML::LibXML->load_html(IO => \*DATA);
for my $anchor ( $doc->findnodes("//a[\@href]") )
{
    printf "%15s -> %s\n",
        $anchor->textContent,
        $anchor->getAttribute("href");
}

__DATA__
<html><head><title/></head><body>
<a href="http://www.google.com">Google</a>
<a href="http://www.apple.com">Apple</a>
</body></html>

–yields–

     Google -> http://www.google.com
      Apple -> http://www.apple.com

Upvotes: 4

Alexandr Ciornii
Alexandr Ciornii

Reputation: 7392

Another way to do this is to use XPath to query parsed HTML. It is needed in complex cases, like extract all links in div with specific class. Use HTML::TreeBuilder::XPath for this.

  my $tree=HTML::TreeBuilder::XPath->new_from_content($c);
  my $nodes=$tree->findnodes(q{//map[@name='map1']/area});
  while (my $node=$nodes->shift) {
    my $t=$node->attr('title');
  }

Upvotes: 5

ysth
ysth

Reputation: 98398

Or consider enhancing HTML::LinkExtor to do what you want, and submitting the changes to the author.

Upvotes: 3

Andy Lester
Andy Lester

Reputation: 93666

Please look at using the WWW::Mechanize module for this. It will fetch your web pages for you, and then give you easy-to-work with lists of URLs.

my $mech = WWW::Mechanize->new();
$mech->get( $some_url );
my @links = $mech->links();
for my $link ( @links ) {
    printf "%s, %s\n", $link->text, $link->url;
}

Pretty simple, and if you're looking to navigate to other URLs on that page, it's even simpler.

Mech is basically a browser in an object.

Upvotes: 40

cjm
cjm

Reputation: 62099

Sherm recommended HTML::LinkExtor, which is almost what you want. Unfortunately, it can't return the text inside the <a> tag.

Andy recommended WWW::Mechanize. That's probably the best solution.

If you find that WWW::Mechanize isn't to your liking, try HTML::TreeBuilder. It will build a DOM-like tree out of the HTML, which you can then search for the links you want and extract any nearby content you want.

Upvotes: 4

converter42
converter42

Reputation: 7516

HTML is a structured markup language that has to be parsed to extract its meaning without errors. The module Sherm listed will parse the HTML and extract the links for you. Ad hoc regular expression-based solutions might be acceptable if you know that your inputs will always be formed the same way (don't forget attributes), but a parser is almost always the right answer for processing structured text.

Upvotes: 2

Related Questions