Reputation:

Using Simple HTML DOM to get absolute URLs

What I want to do: Scape all the links from a page using Simple HTML DOM while taking care to get full links (i.e. from http:// all the way to the end of the address).

My Problem: I get links like /wiki/Cell_wall instead of http://www.wikipedia.com/wiki/Cell_wall.

More examples: If I scrape the URL: http://en.wikipedia.org/wiki/Leaf, I get links like /wiki/Cataphyll, and //en.wikipedia.org/. Or if I'm scraping http://php.net/manual/en/function.strpos.php, I get links like function.strripos.php.

I've tried so many different techniques of building the actual full URL, but there are so many possible cases that I am completely at a loss as to how I can possibly cover all the bases.

However, I'm sure there are many people who've had this problem before - which is why I turn to you!

P.S I suppose this question could almost be reduced to just handling local hrefs, but as mentioned above, I've come across //en.wikipedia.org/ which is not a full url and yet is not local.

Upvotes: 2

Answers (4)

pguardiario

Reputation: 55002

You need a library that converts relative urls to absolute. URL To Absolute seems popular. Then you just:

require('url_to_absolute.php');

foreach($doc->find('a[href]') as $a){
  echo url_to_absolute('http://en.wikipedia.org/wiki/Leaf', $a->href) . "\n";
}

See PHP: How to resolve a relative url for a list of libraries.

Upvotes: 1