MikeEMKI
MikeEMKI

Reputation: 47

Using WWW::Mechanize to scrape multiple pages under a directory - Perl

I'm working on a project to site scrape every interview found here into an HTML ready document to be later dumped into a DB which will automatically update our website with the latest content. You can see an example of my current site scraping script which I asked a question about the other day: WWW::Mechanize Extraction Help - PERL

The problem I can't seem to wrap my head around is knowing if what I'm trying to accomplish now is even possible. Because I don't want to have to guess when a new interview is published, my hope is to be able to scrape the website which has a directory listing of all of the interviews and automatically have my program fetch the content on the new URL (new interview).

Again, the site in question is here (scroll down to see the listing of interviews): http://millercenter.org/president/clinton/oralhistory

My initial thought was to have a regex of .\ at the end of the link above in hopes that it would automatically search any links found under that page. I can't seem to be able to get this to work using WWW::Mechanize, however. I will post what I have below and if anyone has any guidance or experience with this your feedback would be greatly appreciated. I'll also summarize my tasks below the code so that you have a concise understanding of what we hope to accomplish.

Thanks to any and all that can help!

#!/usr/bin/perl -w

use strict;
use WWW::Mechanize;
use WWW::Mechanize::Link;
use WWW::Mechanize::TreeBuilder;

my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/\.");

# find all <dl> tags
my @list = $mech->find('dl');

foreach ( @list ) {
print $_->as_HTML();
}

# # find all links
# my @links = $mech->links();
# foreach my $link (@links) {
#     print "$link->url \n";
# }

To summarize what I'm hoping is possible:

Upvotes: 0

Views: 666

Answers (2)

MarcoS
MarcoS

Reputation: 17711

No, you can't use wildcards on urls.. :-(

You'll have to parse yourself the page with the listing, and then get and process pages in a loop. To extract specific fields from a page contents will be a strightforward task with WWW::Mechanize...

UPDATE: answering OP comment:

Try this logic:

use strict;
use warnings;
use WWW::Mechanize;
use LWP::Simple;
use File::Basename;

my $mech = WWW::Mechanize->new( autocheck => 1 );  
$mech->get("http://millercenter.org/president/clinton/oralhistoryml");

# find all <dl> tags
my @list = $mech->find('dl');

foreach my $link (@list) {
  my $url       = $link->url();  
  my $localfile = basename($url);  
  my $localpath = "./$localfile";

  print "$localfile\n";   
  getstore($url, $localpath);   
}  

Upvotes: 1

simbabque
simbabque

Reputation: 54323

My answer is focused on the approach of how to do this. I'm not providing code.

There are no IDs in the links, but the names of the interview pages seem to be fine to use. You need to parse them out and build a lookup table.

Basically you start by building a parser that fetches all the links that look like an interview. That is fairly simple with WWW::Mechanize. The page URL is:

http://millercenter.org/president/clinton/oralhistory

All the interviews follow this schema:

http://millercenter.org/president/clinton/oralhistory/george-mitchell

So you can find all links in that page that start with http://millercenter.org/president/clinton/oralhistory/. Then you make them unique, because there is this teaser box slider thing that showcases some of them, and it has a read more link to the page. Use a hash to do that like this:

my %seen;
foreach my $url (@urls) {
  $mech->get($url) unless $seen{$url};
  $seen{$url}++;
}

Then you fetch the page and do your stuff and write it to your database. Use the URL or the interview name part of the URL (e.g. goerge-mitchell) as the primary key. If there are other presidents and you want those as well, adapt in case the same name shows up for several presidents.

Then you go back and add a cache lookup into your code. You grab all the IDs from the DB before you start fetching the page, and put those in a hash.

# prepare query and stuff...
my %cache;
while (my $res = $sth->fetchrow_hashref) {
 $cache{$res->{id}}++;
}

# later...
foreach my $url (@urls) {
  next if $cache{$url}; # or grab the ID out of the url
  next if $seen{$url};

  $mech->get($url);
  $seen{$url}++;
}

You also need to filter out the links that are not interviews. One of those would be http://millercenter.org/president/clinton/oralhistory/clinton-description, which is the read more of the first paragraph on the page.

Upvotes: 0

Related Questions