Reputation: 717

PHP - Parse_url only get pages

I'm working on a little webcrawler as a side project at the moment and basically having it collect all hrefs on a page and then subsequently parsing those, my problem is.

How can I only get the actual page results? at the moment i'm using the following

foreach($page->getElementsByTagName('a') as $link) 
{
    $compare_url = parse_url($link->getAttribute('href'));
    if (@$compare_url['host'] == "") 
    { 
        $links[] = 'http://'.@$base_url['host'].'/'.$link->getAttribute('href');
    }
    elseif ( @$base_url['host'] == @$compare_url['host'] ) 
    {
            $links[] = $link->getAttribute('href');
    }   

 }

As you can see this will bring in jpegs, exe files etc. I only need to pickup the web pages like .php, .html, .asp etc.

I'm not sure if there is some function able to work this one out or if it will need to be regex from some sort of master list?

Thanks

Upvotes: 2

Answers (3)

complex857

Reputation: 20753

Since the URL string alone doesn't connected with the resource behind it in any way you will have to go out and ask the webserver about them. For this there's a HTTP method called HEAD so you won't have to download everything.

You can implement this with curl in php like this:

function is_html($url) {
    function curl_head($url) {
        $curl = curl_init($url);
        curl_setopt($curl, CURLOPT_NOBODY, true);
        curl_setopt($curl, CURLOPT_HEADER, true);
        curl_setopt($curl, CURLOPT_MAXREDIRS, 5);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true );
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($curl, CURLOPT_HTTP_VERSION , CURL_HTTP_VERSION_1_1);
        $content = curl_exec($curl);
        curl_close($curl);

        // redirected heads just pile up one after another
        $parts = explode("\r\n\r\n", trim($content));

        // return only the last one
        return end($parts);
    }
    $header = curl_head('http://github.com');
    // look for the content-type part of the header response
    return preg_match('/content-type\s*:\s*text\/html/i', $header);
}

var_dump(is_html('http://github.com'));

This version is only accepts text/html responses and doesn't check if the response is 404 or other error (however follows redirects up to 5 jumps). You can tweak the regexp or add some error handling in either from the curl response, or by matching against the header string's first line.

Note: Webservers will run scripts behind these URLs to give you responses. Be careful not overload hosts with probing, or grabbing "delete" or "unsubscribe" type links.

Upvotes: 1

Ofir Baruch

Reputation: 10356

Consider using preg_match to check the type of the link (application , picture , html file) and considering the results decide what to do.

Another option (and simple) is to use explode and find the last string of the url which comes after a . (the extension) For instance:

//If the URL will has any one of the following extensions , ignore them.
$forbid_ext = array('jpg','gif','exe');

foreach($page->getElementsByTagName('a') as $link) {
    $compare_url = parse_url($link->getAttribute('href'));
    if (@$compare_url['host'] == "")
    { 
           if(check_link_type($link->getAttribute('href')))
           $links[] = 'http://'.@$base_url['host'].'/'.$link->getAttribute('href');
    }
    elseif ( @$base_url['host'] == @$compare_url['host'] )
    {
           if(check_link_type($link->getAttribute('href')))
            $links[] = $link->getAttribute('href');
    }   

    }

function check_link_type($url)
{
   global $forbid_ext;

   $ext = end(explode("." , $url));
   if(in_array($ext , $forbid_ext))
     return false;
   return true;
}

UPDATE (instead of checking 'forbidden' extensions , let's look for good ones)

$good_ext = array('html','php','asp');
function check_link_type($url)
{
   global $good_ext;

   $ext = end(explode("." , $url));
   if($ext == "" || !in_array($ext , $good_ext))
     return true;
   return false;
}

Upvotes: 0

goldstein

Reputation: 361

To check if a page is valid (html,php... extension use this function:

function check($url){
$extensions=array("php","html"); //Add extensions here
foreach($extensions as $ext){
if(substr($url,-(strlen($ext)+1))==".".$ext){
return 1;
}
}
return 0;
}
foreach($page->getElementsByTagName('a') as $link) {
    $compare_url = parse_url($link->getAttribute('href'));
    if (@$compare_url['host'] == "") { if(check($link->getAttribute('href'))){ $links[] = 'http://'.@$base_url['host'].'/'.$link->getAttribute('href');} }
    elseif ( @$base_url['host'] == @$compare_url['host'] ) {
            if(check($link->getAttribute('href'))){ $links[] = $link->getAttribute('href'); }
}

Upvotes: 0

PHP - Parse_url only get pages

Answers (3)

Related Questions