Reputation: 331
So I'm working on a regexp to catch all links in a string, meaning wordsthat start with with a protocol like http, https etc, words that start with www. or words that end in some specific domains, ".com", ".hr" and ".net". But somehow this regexp I made always returns all the links that start with a protocol, but only the last one of those that end in a specific domain. What am I doing wrong :|? Many thanks!
$description='test.com test2.hr http://www.test3.hr https://test4.com test3.net';
$pattern = '/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&@#\/%=~_|$?!:,.]*[A-Z0-9+&@#\/%=~_|$]|(?:\b((?:[\w]+\.com$)|(?:[\w]+\.hr$)|(?:[\w]+\.net$)))/i';
preg_match_all($pattern, $description, $out);
var_dump($out[0]);
Upvotes: 0
Views: 71
Reputation: 700
There are a few problems with your original regex. First, you should be treating the protocol with the conditional modifier ?
. I'm not sure why you're using the second block of [A-Z0-9+&@#\/%=~_|$]
or why you're using the |
operator after that; if there's a specific reason, please let me know. Finally, $
only works as end-of-string when you use it at the very end of the regex; otherwise, you should use \Z
, which matches end-of-string at any point in the regex, although I don't think you want to be matching end-of-string in here anyway. I've rewritten the regex below in the way I think you want it to work:
$description='test.com test2.hr http://www.test3.hr https://test4.com test3.net trash string don\'t match test4.net';
$pattern = '/(?:(?:https?|ftp|file):\/\/(?:www|ftp)\.)?[-A-Z0-9+&@#\/%=~_|$?!:,.]*(\.[A-Z]+)/i';
preg_match_all($pattern, $description, $out);
var_dump($out[0]);
returns:
array(6) {
[0]=>
string(8) "test.com"
[1]=>
string(8) "test2.hr"
[2]=>
string(19) "http://www.test3.hr"
[3]=>
string(17) "https://test4.com"
[4]=>
string(9) "test3.net"
[5]=>
string(9) "test4.net"
}
Upvotes: 1