raj
raj

Reputation: 45

parsing an url for crawler

i am writting an small crawler that extract some 5 to 10 sites while getting the links i am getting some urls like this

../tets/index.html

if it is /test/index.html we can add with base url http://www.example.com/test/index.html

what can i do for this kind of urls.

Upvotes: 2

Views: 307

Answers (3)

Alix Axel
Alix Axel

Reputation: 154563

Take a look into this URL Normalization Wikipedia page.

Upvotes: 0

shamittomar
shamittomar

Reputation: 46692

Use dirname() to get base directoy, remove the .. using substr() and append it there. Like this:

<?php
$url = "../tets/index.html";
$currentURL = "http://example.com/somedir/anotherdir";
echo dirname($currentURL).substr($url, 2);
?>

This outputs:

http://example.com/somedir/tets/index.html

Upvotes: 0

greg0ire
greg0ire

Reputation: 23255

Url like these are relative urls . ".." means "parent directory", whereas "." simply means "this directory", as in bash. For instance, if you are looking at this page : http://www.someserver/test/foo/bar.html , and there is an url like this in it : "../baz/foobar.html", it will in fact point to http://www.someserver/test/baz/foobar.html I think. Just test.

Upvotes: 1

Related Questions