Reputation: 10913
What is wrong with regex pattern that I created:
$link_image_pattern = '/\<a\shref="([^"]*)"\>\<img\s.+\><\/a\>/';
preg_match_all($link_image_pattern, $str, $link_images);
What I'm trying to do is to match all the links which has images inside of them.
But when I try to output $link_images
it contains everything inside the first index:
<pre>
<?php print_r($link_images); ?>
</pre>
The markup looks something like this:
Array ( [0] => Array ([0] => "
<p> </p>
<p><strong><a href="url">Title</a></strong></p>
<p>Desc</p>
<p><a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a></p>
But when outputting the contents of the matches, it simply returns the first string that matches the pattern plus all the other markup in the page like this:
<a href="{$image_url}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url}" width="568" height="347"></a></p>
<p> </p>
<p><strong><a href="url">Title</a></strong></p>
<p>Desc</p>
<p><a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a></p>")
Upvotes: 0
Views: 1342
Reputation: 14990
Regex may not be the best solution to parse HTML, but there are cases where it is the only option such as your text editor doesn't have a "insert html parsing script here" option in the search & replace form. If you are actually using PHP then you'd be better off using a parsing script like:
$Document = new DOMXPath($doc);
foreach ($Document->query('//a//img')) {
# do something with it here
}
This format generally keeps the you-can't-do-that-in-regex haters away. It'll ensure your anchor tag has contains an img tag. While at the same time preventing the odd (and very improbable) edge case where the attribute has something that looks like an image tag.
<a\b(?=\s|>) # match the open anchor tag
(?:='[^']*'|="[^"]*"|=[^'"][^\s>]*|[^>=])* # match the contents of the tag, skipping over the quoted values
> # match the close of the anchor tag
<img\b(?=\s|>) # match the open img tag
(?:='[^']*'|="[^"]*"|=[^'"][^\s>]*|[^>=])* # match the contents of the img tag, skipping over the quoted value
> # match the close of the img tag
<\/a> # matcn the close anchor tag
Sample Text
Note the last line has an ugly attribute which will foil most other regular expression.
<p> </p>
<p><strong><a href="url">Title</a></strong></p>
<p>Desc</p>
<p><a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a></p>
<p><a href="{$image_url2}" Onmouseover="function(' ><img src=picture.png></a> ');" >I do not have an image</a></p>
Code
<?php
$sourcestring="your source string";
preg_match_all('/<a\b(?=\s|>)
(?:=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*|[^>=])*
>
<img\b(?=\s|>)
(?:=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*|[^>=])*
>
<\/a>/imsx',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
[0] => <a href="{$image_url2}"><img style="background-image:none;padding-left:0;padding-right:0;display:inline;padding-top:0;border-width:0;" title="image" border="0" alt="image" src="{$image_url2}" width="569" height="409"></a>
Upvotes: 3
Reputation: 17218
maybe the problem is in .+\>
part because it matches everything till the last >
try the same method as you use for stoping on "
:
[^\>]+
this works in my editor
<a.+><img[^>]+></a>
for your need and you have only to add some backslashes \
before <
, >
and /
Upvotes: -1