Reputation: 3716
i'm trying to create a sitemap for my website so basically i scan the homepage for links and extract the links and do the same thing recursively for extracted links
function get_contents($url = '' ) {
if($url == '' ) { $url = $this->base_url; }
$curl = new cURL;
$content = $curl->get($url);
$this->get_links($content);
}
public function get_links($contents){
$DOM = new DOMDocument();
$DOM->loadHTML($contents);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
$h = $link->getAttribute('href');
$l = $this->base.'/'.$h;
$this->links[] = $l ;
$this->get_contents($l);
}
}
it works fine but there are couple of problems
1-
i get some links ike
www.mysite.com/http://www.external.com
i can do something like
if( stripos( $link , 'http') !== false
||
stripos( $link , 'www.') !== false
||
stripos( $link , 'https') !== false
)
{
if(stripos( $link , 'mysite.com') !== false)
{
//ignor this link (yeah i suck at regex and string mapping)
}
}
but it's seems very complicated and slow , is there any standard and clean way to find out if a link is a external link ?
2 -
is there any way to deal with relative paths ? i get some thing like
www.mysite.com/../Domain/List3.html
obviusly this isn't right i can remove (../) from link but it might not work with all links is there anyway to find out full address of a link ?
Upvotes: 0
Views: 78
Reputation: 11250
For relative paths, you could take a look at realpath()
use parse_url()
to get domain for example so you can easy check
if the domain is equal to your domain. Notice that parse_url()
requires a SCHEME to be defined
so maybe add http:// if there is no http[s].
Upvotes: 2