Web-crawler optimization

I am building a basic search engine using vector-space model and this is the crawler for returning 500 URLs and removes the SGML tags from the content. However, it is very slow (takes more than 30mins for retrieving the URLs only). How can I optimize the code? I have inserted wikipedia.org as an example starting URL.

use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

my $starting_url = 'http://en.wikipedia.org/wiki/Main_Page';
my @urls = $starting_url;
my %alreadyvisited;
my $browser = LWP::UserAgent->new();
$browser->timeout(5);
my $url_count = 0;

while (@urls) 
{ 
     my $url = shift @urls;
     next if $alreadyvisited{$url}; ## check if already visited

     my $request = HTTP::Request->new(GET => $url);
     my $response = $browser->request($request);

     if ($response->is_error())
     {
         print $response->status_line, "\n"; ## check for bad URL
     }
     my $contents = $response->content(); ## get contents from URL
     push @c, $contents;
     my @text = &RemoveSGMLtags(\@c);
     #print "@text\n";

     $alreadyvisited{$url} = 1; ## store URL in hash for future reference
     $url_count++;
     print "$url\n";

     if ($url_count == 500) ## exit if number of crawled pages exceed limit
     {
         exit 0; 
     } 


     my ($page_parser) = HTML::LinkExtor->new(undef, $url); 
     $page_parser->parse($contents)->eof; ## parse page contents
     my @links = $page_parser->links; 

     foreach my $link (@links) 
     {
             $test = $$link[2];
             $test =~ s!^https?://(?:www\.)?!!i;
             $test =~ s!/.*!!;
             $test =~ s/[\?\#\:].*//;
             if ($test eq "en.wikipedia.org")  ## check if URL belongs to unt domain
             {
                 next if ($$link[2] =~ m/^mailto/); 
                 next if ($$link[2] =~ m/s?html?|xml|asp|pl|css|jpg|gif|pdf|png|jpeg/);
                 push @urls, $$link[2];
             }
     }
     sleep 1;
}


sub RemoveSGMLtags 
{
    my ($input) = @_;
    my @INPUTFILEcontent = @$input;
    my $j;my @raw_text;
    for ($j=0; $j<$#INPUTFILEcontent; $j++)
    {
        my $INPUTFILEvalue = $INPUTFILEcontent[$j];
        use HTML::Parse;
        use HTML::FormatText;
        my $plain_text = HTML::FormatText->new->format(parse_html($INPUTFILEvalue));
        push @raw_text, ($plain_text);
    }
    return @raw_text;
}

Upvotes: 2

Answers (2)

Andy Lester

Reputation: 93745

Use WWW::Mechanize which handles all the URL parsing and extraction for you. So much easier than all the link parsing you're dealing with. It was created specifically for the sort of thing you're doing, and it's a subclass of LWP::UserAgent so you should just be able to change all your LWP::UserAgent to WWW::Mechanize without having to change any code, except for all the link extraction, so you can do this:

my $mech = WWW::Mechanize->new();
$mech->get( 'someurl.com' );
my @links = $mech->links;

and then @links is an array of WWW::Mechanize::Link objects.

Upvotes: 3

Borodin

Reputation: 126742

Always use strict
Never use the ampersand & on subroutine calls
Use URI to manipulate URLs

You have a sleep 1 in there, which I assume is to avoid hammering the site too much, which is good. But the bottleneck in almost any web-based application is the internet itself, and you won't be able to make your program any faster without requesting more from the site. That means removing your sleep and perhaps making parallel requests to the server using, for instance, LWP::Parallel::RobotUA. Is that a way you should be going?

Upvotes: 5

Web-crawler optimization

Answers (2)

Related Questions