user993683
user993683

Reputation:

Using Simple HTML DOM to get *absolute* URLs

What I want to do: Scape all the links from a page using Simple HTML DOM while taking care to get full links (i.e. from http:// all the way to the end of the address).

My Problem: I get links like /wiki/Cell_wall instead of http://www.wikipedia.com/wiki/Cell_wall.

More examples: If I scrape the URL: http://en.wikipedia.org/wiki/Leaf, I get links like /wiki/Cataphyll, and //en.wikipedia.org/. Or if I'm scraping http://php.net/manual/en/function.strpos.php, I get links like function.strripos.php.

I've tried so many different techniques of building the actual full URL, but there are so many possible cases that I am completely at a loss as to how I can possibly cover all the bases.

However, I'm sure there are many people who've had this problem before - which is why I turn to you!

P.S I suppose this question could almost be reduced to just handling local hrefs, but as mentioned above, I've come across //en.wikipedia.org/ which is not a full url and yet is not local.

Upvotes: 2

Views: 2753

Answers (4)

pguardiario
pguardiario

Reputation: 55002

You need a library that converts relative urls to absolute. URL To Absolute seems popular. Then you just:

require('url_to_absolute.php');

foreach($doc->find('a[href]') as $a){
  echo url_to_absolute('http://en.wikipedia.org/wiki/Leaf', $a->href) . "\n";
}

See PHP: How to resolve a relative url for a list of libraries.

Upvotes: 1

Paul Dessert
Paul Dessert

Reputation: 6389

I think this is what you're looking for. It worked for me on an old project.

http://www.electrictoolbox.com/php-resolve-relative-urls-absolute/

Upvotes: 1

user993683
user993683

Reputation:

Okay, thanks everyone for your comments.

I think the solution is to use regex to find the webroot of any particular URL, then simply append the local address to this.

Tricky part: Designing a regex statement that works for all domains, including their subdomains...

Upvotes: 0

Dumle29
Dumle29

Reputation: 839

I don't know if this is what you are looking for, but this will give you the full URL of the page it is executed from:

window.location.href

Hope it helps.

Upvotes: 0

Related Questions