Reputation: 206
I am trying to scrape img src's with php, I can get the src fine, but if the src does not include the full path then I can't really reuse it. Is there a way to grab the full path of the image using php (browsers can get it if you use the right click menu).
ie. How do I get a FULL path including the domain in one of the following two examples?
src="../foo/logo.png"
src="/images/logo.png"
Thanks,
Allan
Upvotes: 2
Views: 3869
Reputation: 283043
You don't need a regex... just some patience. I don't really want to write the code for you, but just check if the src starts with http://
, and if not, you have like 3 different cases.
/
then prepend http://domain.com..
you'll have to split the full URL and hack off pieces until the src starts with a /
Or.... be lazy and steal this script
$url = "http://www.goat.com/money/dave.html";
$rel = "../images/cheese.jpg";
$com = InternetCombineURL($url,$rel);
// Returns http://www.goat.com/images/cheese.jpg
function InternetCombineUrl($absolute, $relative) {
$p = parse_url($relative);
if($p["scheme"])return $relative;
extract(parse_url($absolute));
$path = dirname($path);
if($relative{0} == '/') {
$cparts = array_filter(explode("/", $relative));
}
else {
$aparts = array_filter(explode("/", $path));
$rparts = array_filter(explode("/", $relative));
$cparts = array_merge($aparts, $rparts);
foreach($cparts as $i => $part) {
if($part == '.') {
$cparts[$i] = null;
}
if($part == '..') {
$cparts[$i - 1] = null;
$cparts[$i] = null;
}
}
$cparts = array_filter($cparts);
}
$path = implode("/", $cparts);
$url = "";
if($scheme) {
$url = "$scheme://";
}
if($user) {
$url .= "$user";
if($pass) {
$url .= ":$pass";
}
$url .= "@";
}
if($host) {
$url .= "$host/";
}
$url .= $path;
return $url;
}
From http://www.web-max.ca/PHP/misc_24.php
Upvotes: 3
Reputation: 1002
Unless you have the site URL you're starting with (in which case you can prepend it to the value of the src attribute) it seems like all you're left with there is a string.
I'm assuming you don't have access to any additional information of course. If you're parsing HTML, I'd assume you must be able to access an absolute URL to at least the HTML page, but perhaps not.
Upvotes: 2