Henrik Petterson
Henrik Petterson

Reputation: 7094

Only match if it starts with characters

I have this regex to match with image URLs in HTML code:

$regex = '#[\w,=/:.-]+\.(?:jpe?g|png|gif)#iu';

Regex demo

Php demo:

$input = <<<HTML
<a href="https://e...content-available-to-author-only...e.com/example1.jpg">
<a href="https://e...content-available-to-author-only...e.com/ストスト.jpg">
<a href="https://e...content-available-to-author-only...e.com/example3.jpg">
<a href="https://e...content-available-to-author-only...e.com/example3.bak">
HTML;

$dom = new DomDocument();
$dom->loadHTML(mb_convert_encoding($input, 'HTML-ENTITIES', "UTF-8"));

$anchors = $dom->getElementsByTagName("a");
$regex = '#^[\w,=/:.-]+\.(?:jpe?g|png|gif)$#iu';

foreach ($anchors as $anchor) {
    $res = $anchor->getAttribute("href");
    if (preg_match($regex, $res)) {
        echo "Valid url: $res" . PHP_EOL;
    } else {
        echo "Invalid url: $res" . PHP_EOL;
    }
}

My question is, how can I make it only match if it starts with http or //. Currently it matches with example.jpg which isn't a full URL.

Upvotes: 3

Views: 57

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626816

Matching either http or // at the start of the string can be done with ^(?:http|//) t hat you need to add at the start. To make sure the URL ends with the extensions you specified you need to add $ at the end.

Since you obtain the URL string from a tag attribute using $anchor->getAttribute("href") you do not need to validate the inner text of the URL, I suggest replacing [\w,=/:.-]+ with .* to match any text in between.

So, you may use

$regex = '#^(?:http|//).*\.(?:jpe?g|png|gif)$#iu';

Details

  • ^ - start of string
  • (?:http|//) -http or //
  • .* - any 0+ chars other than line break chars, as many as possible
  • \. - a . char
  • (?:jpe?g|png|gif) - jpeg, jpg, png or gif strings
  • $ - end of string.

If you want it to work with the HTML text, you need to use

$regex = '#\bhref=(["\']?)((?:http|//)[^"\']*\.(?:jpe?g|png|gif))\1#iu';
if (preg_match_all($regex, $txt, $matches)) {
    print_r($matches[2]);
}

See the regex demo.

Details

  • \b - word boundary
  • href= - literal text
  • (["\']?) - Group 1: " or ' captured in Group 1
  • ((?:http|//)[^"\']*\.(?:jpe?g|png|gif)) - Group 2:
    • (?:http|//) - http or //
    • [^"\']* - 0+ chars other than ' and "
    • \. - a .
    • (?:jpe?g|png|gif) - extension string
  • \1 - same value as in Group 1, either " or ' or empty.

Upvotes: 1

Michał Turczyn
Michał Turczyn

Reputation: 37367

I'd suggest such pattern: href="((?:http|\/\/)[^"]+\.(?:jpe?g|png|gif))"

Explanation:

href=" - match href=" literally, it will assure that you'll match hyperlink

(...) - capturing group to store actual link

(?:...) - non-capturing group

http|\/\/ - match http or //

[^"]+ - match 1+ of any characters other from "

\. - match . literally

jpe?g|png|gif - alterantion, match onne of the options jpeg, jpg (due to e?), png, gif

" - match " literally

Demo

Matched link will be inside 1st capturing group.

Upvotes: 1

Related Questions