Reputation: 10413
I have searched many Stackoverflow regular expression posts but couldn't find my answer.
I am using the following to find all URLs in a given $text
string:
$pattern = "#((http|https|ftp|ftps)://)?([a-zA-Z0-9\-]*\.)+[a-zA-Z0-9]{2,4}(/[a-zA-Z0-9=.?&-]*)?#";
(agreed there might be more precise/efficient/... but it is not the problem... yet).
Now with this text input:
$text = "Website: www.example.com, ";
$text .= "Contact us: http://www.example.com/cu?t=contactus#anchor, ";
$text .= "Email: [email protected]";
Then a
preg_match_all($pattern, $text, $matches);
would return these:
www.example.com
http://www.example.com/cu?t=contactus
example.com
The last example.com
comes from the email and I want to be able to exclude it.
I tried many combinations of [^@]
, (?!@)
... to no avail, I am still getting the email results.
The best I could do is to include an optional @
at the beginning so it would return @example.com
and then I loop my results to exclude the ones starting with @
.
Is there any better solution? A single pattern that would not include the sub-strings that are emails?
Upvotes: 2
Views: 2983
Reputation: 4092
An example solution without using too advanced features like assertions:
<?php
$text = 'ftp://web.com, ';
$text .= "Website: www.example.com, ";
$text .= "Contact us: http://www.example.com/cu?t=contactus#anchor, ";
$text .= "Email: [email protected]";
$base = "((http|https|ftp|ftps)://)?([a-zA-Z0-9\-]*\.)+[a-zA-Z0-9]{2,4}(/[a-zA-Z0-9=.?&-]*)?";
$matches = array(); preg_match_all("#$base#", $text, $matches); var_dump($matches[0]);
$matches = array(); preg_match_all("#\s($base)#", " $text", $matches); var_dump($matches[1]);
?>
Output:
array(4) {
[0]=>
string(13) "ftp://web.com"
[1]=>
string(15) "www.example.com"
[2]=>
string(37) "http://www.example.com/cu?t=contactus"
[3]=>
string(11) "example.com"
}
array(3) {
[0]=>
string(13) "ftp://web.com"
[1]=>
string(15) "www.example.com"
[2]=>
string(37) "http://www.example.com/cu?t=contactus"
}
Simply check for whitespace before URL but not include it in subpattern. Using [^@]
won't work because regex would simply match e
as [^@]
and xample.com
as the rest of the match - they would be later merged into a single match.
Upvotes: 1