max
max

Reputation: 3716

working with links : identifying external links and full address of links

i'm trying to create a sitemap for my website so basically i scan the homepage for links and extract the links and do the same thing recursively for extracted links

  function get_contents($url = '' ) {
    if($url == '' ) { $url = $this->base_url; }
    $curl = new cURL;
    $content = $curl->get($url);
    $this->get_links($content);
 }

 public function get_links($contents){

                $DOM = new DOMDocument();
                $DOM->loadHTML($contents);
                $a = $DOM->getElementsByTagName('a');
                foreach($a as $link){

                     $h =  $link->getAttribute('href'); 
                     $l =  $this->base.'/'.$h;
                     $this->links[] = $l ;
                     $this->get_contents($l);

                }
 }

it works fine but there are couple of problems

1-

i get some links ike

www.mysite.com/http://www.external.com

i can do something like

  if( stripos( $link  , 'http')  !== false
        ||
        stripos( $link  , 'www.')  !== false
        ||
        stripos( $link  , 'https') !== false
     )
    {
            if(stripos( $link  , 'mysite.com') !== false)
            {
               //ignor this link (yeah i suck at regex and string mapping)
            }
    }

but it's seems very complicated and slow , is there any standard and clean way to find out if a link is a external link ?

2 -

is there any way to deal with relative paths ? i get some thing like

www.mysite.com/../Domain/List3.html

obviusly this isn't right i can remove (../) from link but it might not work with all links is there anyway to find out full address of a link ?

Upvotes: 0

Views: 78

Answers (1)

Alex2php
Alex2php

Reputation: 11250

For relative paths, you could take a look at realpath()

use parse_url() to get domain for example so you can easy check if the domain is equal to your domain. Notice that parse_url() requires a SCHEME to be defined so maybe add http:// if there is no http[s].

Upvotes: 2

Related Questions