Reputation: 95
I have a regular expression that matches words with .
in between them as potential urls but not those with @
in front of them as they are assumed to be emails.
This is the regex that I have:
(?:\@(https?:\/\/)?(\w+(\-*\w+)*\.)[a-zA-Z\.]+[\w+\/?\#?\??\=\%\&\-]+.*?)*\K(https?:\/\/)?(\w+(\-*\w+)*\.)[a-zA-Z\.]+[\w+\/?\#?\??\=\%\&\-]+
This is not working for the last occurrence of email perfectly.
For example, for the string
twitter.com facebook.com [email protected] [email protected] [email protected] [email protected] john wayne <[email protected]> 20,000.00
I expect the matches to be twitter.com
and facebook.com
.
But it also matches dc.com
.
Upvotes: 0
Views: 309
Reputation: 10139
In your (?:\@(https?:\/\/)
, the ?
in https?:
will match either http or https. The ?
literally means 0 or 1 of the character s
. The :
you refer to in https?:
is matching a literal :
, nothing special.
Now, the difference is if your ?:
comes after a non-escaped opening parenthesis, then that means it's a non-capturing group.
Escaped:
\(?:
, not a non-capturing group
Not-Escaped:(?:
, is a non-capturing group
The next portion of your question, what does the .*?
in [\w+\/?\#?\??\=\%\&\-]+.*?
refer to?
.
will match any character*
is a quantifier that will match your .
(any character) 0 to unlimited times*?
makes *
non-greedy. An internet search will provide you with a lot of information on what a non-greedy match is if you are unaware.Upvotes: 4