IMUXIxD
IMUXIxD

Reputation: 1227

simple PHP web crawler working on some, certain types of, pages

I made this simple PHP web crawler that gets the source from a page after the opening body tag, strips the other HTML tags, and then echo's the content.

It works when I initiate it giving it a page ending in .html, but when I give in a URL like the URL to a set of results from Google, it doesn't follow those links and get the content and echo the content.

How can I get it to follow the URL of a Google search result and follow the links within and echo their content?

Here is the code of the crawler:

error_reporting( E_ERROR );

define( "CRAWL_LIMIT_PER_DOMAIN", 50 );

$domains = array();

$urls = array();

$dom = new DOMDocument();

$matches = array();

function crawl( $domObject, $url, $matchList )
{
    global $domains, $urls;
    $parse = parse_url( $url );
    $domains[ $parse['host'] ]++;
    $urls[] = $url;

    $content = file_get_contents( $url );
    if ( $content === FALSE ){
         return;
}

    $content = stristr($content, "<body>");
    $domObject->loadHTML($content);
    $anchors = $domObject->getElementsByTagName('a');
    foreach($anchors as $anchor){
         if(preg_match('/(?:https?:\/\/|www)[^\'\" ]*/i', (string)($anchor->getAttribute('href')))){
             array_push($matchList, (string)($anchor->getAttribute('href')));
         }
         else{
             preg_match('/(?:https?:\/\/|www)[^\/]+(?:\S*?\/)*/i', $url, $beginings);
             $urlPrefix = $beginings[0];
             $absolute = (string)(((string)$urlPrefix).((string)$anchor->getAttribute('href')));
             array_push($matchList, $absolute);
         }
     }
     echo  strip_tags($content) . "<br /><br /><br />";

     foreach( $matchList as $crawled_url ) {
         $parse = parse_url( $crawled_url );
         if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN && !in_array( $crawled_url, $urls ) ) {
            sleep( 1 );
            crawl( $domObject, $crawled_url, $matchList );
         }
      }
 }

 crawl($dom, 'http://www.google.com/search?q=google', $matches);

Upvotes: 0

Views: 3035

Answers (1)

Kohjah Breese
Kohjah Breese

Reputation: 4136

I'm not sure what your using to download URLs.

I'd recommend using this:

http://semlabs.co.uk/journal/object-oriented-curl-class-with-multi-threading

I'm fairly sure Google uses 301 or 302 redirects from links in the search results. So you need your crawler to follow redirects. I assume this is the problem.

Using that class, you need to use the option: CURLOPT_FOLLOWLOCATION

See: http://php.net/manual/en/function.curl-setopt.php

Further, if you are planning on scrapping Google, you'll need a lot of sleeps, and or some good proxies. Google blocks automated queries. A way around this somewhat is to pay $100 for Google XML results via Google Custom Search.

Upvotes: 3

Related Questions