JScoobyCed
JScoobyCed

Reputation: 10413

preg_match_all to find all URL but exclude email

I have searched many Stackoverflow regular expression posts but couldn't find my answer. I am using the following to find all URLs in a given $text string:

$pattern = "#((http|https|ftp|ftps)://)?([a-zA-Z0-9\-]*\.)+[a-zA-Z0-9]{2,4}(/[a-zA-Z0-9=.?&-]*)?#";

(agreed there might be more precise/efficient/... but it is not the problem... yet).

Now with this text input:

$text = "Website: www.example.com, ";
$text .= "Contact us: http://www.example.com/cu?t=contactus#anchor, ";
$text .= "Email: [email protected]";

Then a

preg_match_all($pattern, $text, $matches);

would return these:

www.example.com
http://www.example.com/cu?t=contactus
example.com

The last example.com comes from the email and I want to be able to exclude it.
I tried many combinations of [^@], (?!@) ... to no avail, I am still getting the email results.

The best I could do is to include an optional @ at the beginning so it would return @example.com and then I loop my results to exclude the ones starting with @.

Is there any better solution? A single pattern that would not include the sub-strings that are emails?

Upvotes: 2

Views: 2983

Answers (1)

John
John

Reputation: 4092

An example solution without using too advanced features like assertions:

<?php

$text = 'ftp://web.com, ';
$text .= "Website: www.example.com, ";
$text .= "Contact us: http://www.example.com/cu?t=contactus#anchor, ";
$text .= "Email: [email protected]";

$base = "((http|https|ftp|ftps)://)?([a-zA-Z0-9\-]*\.)+[a-zA-Z0-9]{2,4}(/[a-zA-Z0-9=.?&-]*)?";

$matches = array(); preg_match_all("#$base#", $text, $matches); var_dump($matches[0]);
$matches = array(); preg_match_all("#\s($base)#", " $text", $matches); var_dump($matches[1]);

?>

Output:

array(4) {
  [0]=>
  string(13) "ftp://web.com"
  [1]=>
  string(15) "www.example.com"
  [2]=>
  string(37) "http://www.example.com/cu?t=contactus"
  [3]=>
  string(11) "example.com"
}
array(3) {
  [0]=>
  string(13) "ftp://web.com"
  [1]=>
  string(15) "www.example.com"
  [2]=>
  string(37) "http://www.example.com/cu?t=contactus"
}

Simply check for whitespace before URL but not include it in subpattern. Using [^@] won't work because regex would simply match e as [^@] and xample.com as the rest of the match - they would be later merged into a single match.

Upvotes: 1

Related Questions