Reputation: 1227
I'm trying to run a web crawler that is pointed at one url, that has no links, the code seems fine; but, I am getting an http 500 error.
All it does with the content it crawls is echo it.
Any idea why?
<?php
error_reporting( E_ERROR );
define( "CRAWL_LIMIT_PER_DOMAIN", 50 );
$domains = array();
$urls = array();
function crawl( $url )
{
global $domains, $urls;
$parse = parse_url( $url );
$domains[ $parse['host'] ]++;
$urls[] = $url;
$content = file_get_contents( $url );
if ( $content === FALSE ){
echo "Error: No content";
return;
}
$content = stristr( $content, "body" );
preg_match_all( '/http:\/\/[^ "\']+/', $content, $matches );
// do something with content.
echo $content;
foreach( $matches[0] as $crawled_url ) {
$parse = parse_url( $crawled_url );
if ( count( $domains[ $parse['host'] ] ) < CRAWL_LIMIT_PER_DOMAIN && !in_array( $crawled_url, $urls ) ) {
sleep( 1 );
crawl( $crawled_url );
}
}
}
crawl(http://the-irf.com/hello/hello6.html);
?>
Upvotes: 2
Views: 913
Reputation: 11393
Replace:
crawl(http://the-irf.com/hello/hello6.html);
with:
crawl('http://the-irf.com/hello/hello6.html');
The URL is a text string, so it must be enclosed in quotes.
About your problem with stristr:
Returns all of haystack starting from and including the first occurrence of needle to the end.
So, your code:
$content = stristr( $content, "body" );
will return all of $content
starting from and including the first occurence of body
.
Upvotes: 3