Using multi-threading to improve web crawler / spider

Question

I've start learning about web crawlers and with the help of an article I built this simple one below.

It suggested using multithreading to improve and make the web crawler faster.

I was wondering if someone could help me learn more about multithreading and maybe even apply it to the crawler below.

Also, if you have any other suggestions or improvements to improve this crawler please feel free to share.

Here is the code:

error_reporting( E_ERROR );

define( "CRAWL_LIMIT_PER_DOMAIN", 50 );

$domains = array();

$urls = array();

function crawl( $url )
{
    global $domains, $urls;
    $parse = parse_url( $url );
    $domains[ $parse['host'] ]++;
    $urls[] = $url;

    $content = file_get_contents( $url );
    if ( $content === FALSE )
        return;
    else {
        // do something with content.
    }

    $content = stristr( $content, "body" );
    preg_match_all( '/http:\/\/[^ "\']+/', $content, $matches );

    foreach( $matches[0] as $crawled_url ) {
        $parse = parse_url( $crawled_url );
        if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN && !in_array( $crawled_url, $urls ) ) {
            sleep( 1 );
            crawl( $crawled_url );
        }
    }
}

Thank you in Advance— I'd appreciate all and any help.

lupatus · Accepted Answer

Fortunately or not, PHP is not supporting multithreading. What you can do is to implement asynchronus pattern, but that means you have to forget about nice, one line functions like file_get_contents and switch to lowest level page reading (with fsockopen and then do all writes and reads manually in non-blocking mode, letting others do they job while particular action requires waiting), see example code here.

Using multi-threading to improve web crawler / spider

Answers (1)

Related Questions