Reputation:
What I want to do: Scape all the links from a page using Simple HTML DOM while taking care to get full links (i.e. from http://
all the way to the end of the address).
My Problem: I get links like /wiki/Cell_wall
instead of http://www.wikipedia.com/wiki/Cell_wall
.
More examples: If I scrape the URL: http://en.wikipedia.org/wiki/Leaf
, I get links like /wiki/Cataphyll
, and //en.wikipedia.org/
. Or if I'm scraping http://php.net/manual/en/function.strpos.php
, I get links like function.strripos.php
.
I've tried so many different techniques of building the actual full URL, but there are so many possible cases that I am completely at a loss as to how I can possibly cover all the bases.
However, I'm sure there are many people who've had this problem before - which is why I turn to you!
P.S I suppose this question could almost be reduced to just handling local href
s, but as mentioned above, I've come across //en.wikipedia.org/
which is not a full url and yet is not local.
Upvotes: 2
Views: 2753
Reputation: 55002
You need a library that converts relative urls to absolute. URL To Absolute seems popular. Then you just:
require('url_to_absolute.php');
foreach($doc->find('a[href]') as $a){
echo url_to_absolute('http://en.wikipedia.org/wiki/Leaf', $a->href) . "\n";
}
See PHP: How to resolve a relative url for a list of libraries.
Upvotes: 1
Reputation: 6389
I think this is what you're looking for. It worked for me on an old project.
http://www.electrictoolbox.com/php-resolve-relative-urls-absolute/
Upvotes: 1
Reputation:
Okay, thanks everyone for your comments.
I think the solution is to use regex to find the webroot of any particular URL, then simply append the local address to this.
Tricky part: Designing a regex statement that works for all domains, including their subdomains...
Upvotes: 0
Reputation: 839
I don't know if this is what you are looking for, but this will give you the full URL of the page it is executed from:
window.location.href
Hope it helps.
Upvotes: 0