Fluidbyte
Fluidbyte

Reputation: 5210

PHP Regex to determine relative or absolute path

I'm using cURL to pull the contents of a remote site. I need to check all "href=" attributes and determine if they're relative or absolute path, then get the value of the link and path it to something like href="http://www.website.com/index.php?url=[ABSOLUTE_PATH]"

Any help would be greatly appreciated.

Upvotes: 1

Views: 3700

Answers (2)

newfurniturey
newfurniturey

Reputation: 38446

A combination of a regex* and HTML's parse_url() should help:

// find all links in a page used within href="" or href='' syntax
$links = array();
preg_match_all('/href=(?:(?:"([^"]+)")|(?:\'([^\']+)\'))/i', $page_contents, $links);

// iterate through each array and check if it's "absolute"
$urls = array();
foreach ($links as $link) {
    $path = $link;
    if ((substr($link, 0, 7) == 'http://') || (substr($link, 0, 8) == 'https://')) {
        // the current link is an "absolute" URL - parse it to get just the path
        $parsed = parse_url($link);
        $path = $parsed['path'];
    }
    $urls[] = 'http://www.website.com/index.php?url=' . $path;
}

To determine if the URL is absolute or not, I simply have it check if the beginning of the URL is http:// or https://; if your URLs contain other mediums such as ftp:// or tel:, you might need to handle those as well.

This solution does use regex to parse HTML, which is often frowned upon. To circumvent, you could switch to using [DOMDocument][2], but there's no need for extra code if there aren't any issues.

Upvotes: 1

ioseb
ioseb

Reputation: 16951

Here is the one possible solution if i understood question correctly:

$prefix = 'http://www.website.com/index.php?url=';
$regex = '~(<a.*?href\s*=\s*")(.*?)(".*?>)~is';
$html = file_get_contents('http://cnn.com');

$html = preg_replace_callback($regex, function($input) use ($prefix) {
  $parsed = parse_url($input[2]);

  if (is_array($parsed) && sizeof($parsed) == 1 && isset($parsed['path'])) {
    return $input[1] . $prefix . $parsed['path'] . $input[3];
  }
}, $html);

echo $html;

Upvotes: 1

Related Questions