Reputation: 103

preg_match to find links to images and url

I am trying to use preg_match to find urls mentioned inside and tags so that I can replace them with the updated domain name. Right now I am just trying to get the search script for this figured out in href tags so that I can print the urls found. Here is what I have:

$matches = array();
$search="domain.com";
preg_match('|(<a\s*[^>]*href=[\'"]?)|',$prod['value'],$matches);
echo '<p>'.$matches[1].'</p>';

$prod['value'] refers to the content that I am trying to sift through

Upvotes: 1

Answers (1)

Steven

Reputation: 6148

Your Code

$matches = array();
$search="domain.com";
preg_match('|(<a\s*[^>]*href=[\'"]?)|',$prod['value'],$matches);
echo '<p>'.$matches[1].'</p>';

Firstly, $matches doesn't need to be defined before the preg_match call. You just have to provide a variable name and PHP won't so much as throw a notice.

Secondly, $search doesn't seem to be relevant to the question?..

Third... Bearing in mind that you haven't shown example input I'm going to make an assumption that you actually want preg_match_all so that you can get a list of all URLs from the input.

Fourth, following on from three, that means you need var_dump or print_r instead of echo as the content of $matches[X] will be an array.

Regex

Okay, so now for what your regex pattern actually does...

(<a\s*[^>]*href=['"]?)

( - starts a capture group
<a\s* - matches <a followed by 0 or more white space characters
[^>]* - matches 0 or more characters that are not >
href= - matches href=
['"]? - optionally matches either ' or "
) - ends capture group

This all means that run against the example input your regex will match <a href=" from the first link example (google) and <a class="fancyStyle" href=" from the second link example (youtube).

/**
Output from:

preg_match_all('|(<a\s*[^>]*href=[\'"]?)|', $string, $matches);
var_dump($matches);

*/
array(2) {
  [0]=>
  array(2) {
    [0]=>
    string(9) "<a href=""
    [1]=>
    string(28) "<a class="fancyStyle" href=""
  }
  [1]=>
  array(2) {
    [0]=>
    string(9) "<a href=""
    [1]=>
    string(28) "<a class="fancyStyle" href=""
  }
}

Working Code

There are a few problems with your code, but, the one that is stopping you from getting the expected URL is that you simply stop capturing before you get to it.

The following regex will match URLs that are within the href attribute of a tags.

#<a\s.*?(?:href=['"](.*?)['"]).*?>#is

Explanation

<a - matches the opening of an a tag
\s.*? - matches a white space character followed by any character 0 or more times
(?: - creates a non-capturing group
href= - matches href=
['"] - matches either ' or "
(.*?) - creates a capture group and matches 0 or more characters before...
['"] - matches ' or "
) - ends the non-capturing group
.*?> - matches any character 0 or more times followed by >
i - makes the regex case insensitive
s - makes . match all characters (including new lines)

Working Example

preg_match_all('#<a\s.*?(?:href=[\'"](.*?)[\'"]).*?>#is', $string, $matches);
var_dump($matches);

/**
array(2) {
  [0]=>
  array(2) {
    [0]=>
    string(34) "<a href="http://www.google.co.uk">"
    [1]=>
    string(65) "<a class="fancyStyle" href="http://www.youtube.com" id="link136">"
  }
  [1]=>
  array(2) {
    [0]=>
    string(23) "http://www.google.co.uk"
    [1]=>
    string(22) "http://www.youtube.com"
  }
}

*/

Example Input

All code uses the following as input into the preg_match function...

$string = <<<EOC
    <!doctype html>
    <html lang="en">
    <head>
        <meta charset="UTF-8">
        <title>Title of page</title>
    </head>
    <body>
        <h1>Main Page title</h1>
        <p>
            The following is a <a href="http://www.google.co.uk">link to google</a>.
            This is <a class="fancyStyle" href="http://www.youtube.com" id="link136">another link</a>
        </p>
    </body>
    </html>
EOC;

Upvotes: 3