Reputation: 13527
I am able to scrape a page for URLs, but I want to know what is the easiest way to convert the various formats that these links can be in, into a fully fledged url. For example:
If I scrape: www.mysite.com/some/place/in/space.html
And I get the following urls:
../img.jpg
img.jpg
../../bla.jpg
inc/bla.jpg
/
./
They should resolve to
www.mysite.com/some/place/img.jpg
www.mysite.com/some/place/in/img.jpg
www.mysite.com/some/bla.jpg
www.mysite.com/some/place/in/inc/bla.jpg
www.mysite.com/some/place/in/
www.mysite.com/some/place/in/
Is there a function that does this for all cases or is it something I would have to code?
Upvotes: 0
Views: 107
Reputation: 48131
I use this function for a crawler i wrote long time ago: http://codepad.org/1VxMECNj
call the function with host prepended:
relativeUrl('http://host/dir/dir2/../../file.html');
//> returns http://host/file.html
Upvotes: 1
Reputation: 11623
You could do a REGEX to replace the relative links with the absolute URLs:
$data = preg_replace('#(href|src)="([^:"]*)("|(?:(?:%20|\s|\+)[^"]*"))#', '$1="' . $site_url . '$2$3', $data);
Upvotes: 0
Reputation: 4415
You can just add www.mysite.com/some/place/in/
in front of the urls.. www.mysite.com/some/place/in/../img.jpg should resolve I think.
Upvotes: 0