Reputation: 1227
I've start learning about web crawlers and with the help of an article I built this simple one below.
It suggested using multithreading to improve and make the web crawler faster.
I was wondering if someone could help me learn more about multithreading and maybe even apply it to the crawler below.
Also, if you have any other suggestions or improvements to improve this crawler please feel free to share.
Here is the code:
error_reporting( E_ERROR );
define( "CRAWL_LIMIT_PER_DOMAIN", 50 );
$domains = array();
$urls = array();
function crawl( $url )
{
global $domains, $urls;
$parse = parse_url( $url );
$domains[ $parse['host'] ]++;
$urls[] = $url;
$content = file_get_contents( $url );
if ( $content === FALSE )
return;
else {
// do something with content.
}
$content = stristr( $content, "body" );
preg_match_all( '/http:\/\/[^ "\']+/', $content, $matches );
foreach( $matches[0] as $crawled_url ) {
$parse = parse_url( $crawled_url );
if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN && !in_array( $crawled_url, $urls ) ) {
sleep( 1 );
crawl( $crawled_url );
}
}
}
Thank you in Advance— I'd appreciate all and any help.
Upvotes: 0
Views: 1167
Reputation: 4248
Fortunately or not, PHP is not supporting multithreading. What you can do is to implement asynchronus pattern, but that means you have to forget about nice, one line functions like file_get_contents
and switch to lowest level page reading (with fsockopen
and then do all writes and reads manually in non-blocking mode, letting others do they job while particular action requires waiting), see example code here.
Upvotes: 2