Reputation: 7094
I have this regex to match with image URLs in HTML code:
$regex = '#[\w,=/:.-]+\.(?:jpe?g|png|gif)#iu';
$input = <<<HTML
<a href="https://e...content-available-to-author-only...e.com/example1.jpg">
<a href="https://e...content-available-to-author-only...e.com/ストスト.jpg">
<a href="https://e...content-available-to-author-only...e.com/example3.jpg">
<a href="https://e...content-available-to-author-only...e.com/example3.bak">
HTML;
$dom = new DomDocument();
$dom->loadHTML(mb_convert_encoding($input, 'HTML-ENTITIES', "UTF-8"));
$anchors = $dom->getElementsByTagName("a");
$regex = '#^[\w,=/:.-]+\.(?:jpe?g|png|gif)$#iu';
foreach ($anchors as $anchor) {
$res = $anchor->getAttribute("href");
if (preg_match($regex, $res)) {
echo "Valid url: $res" . PHP_EOL;
} else {
echo "Invalid url: $res" . PHP_EOL;
}
}
My question is, how can I make it only match if it starts with http
or //
. Currently it matches with example.jpg
which isn't a full URL.
Upvotes: 3
Views: 57
Reputation: 626816
Matching either http
or //
at the start of the string can be done with ^(?:http|//)
t hat you need to add at the start. To make sure the URL ends with the extensions you specified you need to add $
at the end.
Since you obtain the URL string from a tag attribute using $anchor->getAttribute("href")
you do not need to validate the inner text of the URL, I suggest replacing [\w,=/:.-]+
with .*
to match any text in between.
So, you may use
$regex = '#^(?:http|//).*\.(?:jpe?g|png|gif)$#iu';
Details
^
- start of string(?:http|//)
-http
or //
.*
- any 0+ chars other than line break chars, as many as possible\.
- a .
char(?:jpe?g|png|gif)
- jpeg
, jpg
, png
or gif
strings$
- end of string.If you want it to work with the HTML text, you need to use
$regex = '#\bhref=(["\']?)((?:http|//)[^"\']*\.(?:jpe?g|png|gif))\1#iu';
if (preg_match_all($regex, $txt, $matches)) {
print_r($matches[2]);
}
See the regex demo.
Details
\b
- word boundaryhref=
- literal text(["\']?)
- Group 1: "
or '
captured in Group 1((?:http|//)[^"\']*\.(?:jpe?g|png|gif))
- Group 2:
(?:http|//)
- http
or //
[^"\']*
- 0+ chars other than '
and "
\.
- a .
(?:jpe?g|png|gif)
- extension string\1
- same value as in Group 1, either "
or '
or empty.Upvotes: 1
Reputation: 37367
I'd suggest such pattern: href="((?:http|\/\/)[^"]+\.(?:jpe?g|png|gif))"
Explanation:
href="
- match href="
literally, it will assure that you'll match hyperlink
(...)
- capturing group to store actual link
(?:...)
- non-capturing group
http|\/\/
- match http
or //
[^"]+
- match 1+ of any characters other from "
\.
- match .
literally
jpe?g|png|gif
- alterantion, match onne of the options jpeg
, jpg
(due to e?
), png
, gif
"
- match "
literally
Matched link will be inside 1st capturing group.
Upvotes: 1