Reputation: 53551
I want to extract all the links from a page. I am using HTML:LinkExtor
. How do I extract all the links that point to HTML content pages only?
I also cannot extract these kinds of links:
javascript:openpopup('http://www.admissions.college.harvard.edu/financial_aid/index.html'),
EDIT: HTML Pages - text/html. I am not indexing pictures etc.
Upvotes: 0
Views: 1457
Reputation: 11
Perl is going to have a lot of ways to do this through brute force. You could use the Push/Pull Parser to jump between tags. You might be able to just slurp the entire page and regexp through it for links, or for links within JavaScript.
Have you looked at WWW::Mechanize::Plugin::JavaScript? The WWW::Mechanize module is a web botting best friend (not that you are trying to bot). I've used this module before and can say its one of the best Perl module on CPAN.
Here is an example from CPAN: Sets the named variable to the value given
$m->plugin('JavaScript')->set(
'document', 'location', 'href' => 'http://www.perl.org/');
Upvotes: 1
Reputation: 19
I'd use WWW::Mechanize for most link gathering. Other than that I'd do my own matching:
my @links = $content =~ m`javascript:openpopup\('([^\']+)'`g;
Upvotes: 0
Reputation: 44186
Yes, HTML::LinkExtor does not understand javascript. In fact, it's pretty unlikely that you'll get anything that recognizes URLs embedded in javascript, simply because that would require typically running actual code.
Upvotes: 2