Reputation: 186
I'm looking for regex to match entire urls that are NOT from .com, .net, .org, .info, .edu, .gov, or .ca domains. The TLD list may grow over time, but it's a good start.
These would match:
These would NOT match:
For a little background, I'm looking to use the expression with Exchange Online to filter inbound email containing unusual/international links, which in our case are almost 100% phishing or spam. We're a small business that only services local customers and generally all of vendors are North American.
Upvotes: 1
Views: 2232
Reputation: 5354
Tricky one. Here it is:
/https?:\/\/(?![\w.-]+\.(?:com|edu|gov|ca|net|org|info)[^\w.-])\S*/gi
Works for all the use cases listed below:
https://www.organization.org/
https://some-server.example.ca/home
https://www.complete.com/index.php
https://www.example.com
https://www.example.com/?url=junk.xyz
https://www.freddy.dana.comealong.com/
http://www.example.info:8181
https://some_server.example.ca/home.html
https://www.complete.com/index.php
https://www.organization.org/
https://www.spammy.spammer.comealong.cop/
https://spam.caught.cat/home/away/now/index.htm
https://www.complete.xyz/index.php?com=seww
https://www.example.abc/?spam=yes&spammer=yep&from=me.com
https://www.example.ru/spammy/spammer/index.php
https://www.com_server.ru/?url=beep.gov¶=HaHaGotYou
https://www.example.ru
https://www.example.ru/home.html.com
http://www.example.xyz/index.php
https://some-server.example.co.uk/home
Top group is not matched, bottom group matches all so you can send off to /dev/cornfield
In the bottom group, please notice that there are URL parameters that DO have .com in them, but my assumption is you want to blow those away as well, so the regex is very narrow in defining how and where the TLD appears.
And there are URLs like www.complete.xyz
or "example.abc/spam.com ', which obviously should be selected. Details below:
Here's a link to a regex pen: https://regexr.com/69vce
/https?:\/\/(?![\w.-]+\.(?:com|edu|gov|ca|net|org|info)[^\w.-])\S*/gi
Starts with the obv https?:\/\/
but we go immediately into a negative lookahead (?!
For the non-selected URLs, we want that based on the TLD as used only, no .com in a parameter and no skipping a name like www.complete.abc
So the first part of the negative lookahead is [\w.-]+\.
so we only evaluate letters numbers and - .
in the brackets with +
for one to many as their may be before the TLD, then one single mandatory escaped period \.
which is how we "lock in" the TLD.
Note 1: that inside the brackets, the period does not need to be escaped, when inside the brackets [.] it is a literal period not a wildcard.
Note 2: the \w includes the underscore _
which is not a legal domain character, but we disregard as we do not need to specifically validate the domain names as presented.
And next a non-capture group, with ORed |
list of the TL domains to NOT match, then [^\w.-]
is how we block names like www.complete.xyz
. This rejects any TLD letters IF followed by any legal domain name char: letter number period hyphen. Notice by the way the hyphen - is LAST in the group, because if it was instead, say [^-\w.]
it would be an error in some implementations of regex as the hyphen is otherwise used for series such as a-z
.
Then finally the \S*
means match all except a whitespace character. So if the negative lookahead did not reject this line match, we then step back to the http://
and take the entire rest of the URL.
Now, this is potentially a little broad, but since I assume you are just trashing them that should be fine. If you were selecting them for further use, then you may want to use something more selective like [\w.:%&?~=/-]*
instead. This includes period, colon for port, = & ?
for params, %
for URL escaping, etc. And again the hyphen is last.
And of course at the very end, global and case insensitive /gi
Upvotes: 2
Reputation: 42021
To match the entire URL...
Note this implementation attempts to cover additional elements based on the usage for matching unusual URLs:
ftp
, ldap
)https://www.example.com:8080/
)I don't know the exact regular expression engine used by "Exchange Online", so here I'm using RegEx features of C# and PowerShell assuming those will be available.
[a-z][a-z0-9+.-]*://(?>(?:[a-z0-9!$%&'()*+,.:;=_~-]+@)?(?:[a-z0-9%._~-]+|\[[a-z0-9!$%&'()*+,.:;=_~-]+\]))(?<!\.(?:com|net|org|info|edu|gov|ca)(?::\d+)?)[a-z0-9!#$%&'()*+,./:;=?@_~-]*
http
/https
/ftp
etc): [a-z][a-z0-9+\-.]*
(?>
(?:[a-z0-9!$%&'()*+,.:;=_~-]+@)?
(?:[a-z0-9%._~-]+|\[[a-z0-9!$%&'()*+,.:;=_~-]+\]))
[a-z0-9%._~-]+
\[[a-z0-9!$%&'()*+,.:;=_~-]+\]
(?<!\.(?:com|net|org|info|edu|gov|ca)(?::\d+)?)
(?::\d+)?
)
[a-z0-9!#$%&'()*+,./:;=?@_~-]*
The atomic group is to prevent the "Username / password" and "Query String and Fragment" part of the expression matching as the "Hostname" part of the string without our validations.
If you are using this regular expression to match URL's in a text document you might find some issues with "quoted" URLs or markdown links.
E.g.
[an example](http://example.cox/)
'http://www.example.cox/'
http://www.example.cox/index.html, something interesting in a sentence
You can get it here http://www.example.cox/download.html.
This RegEx as-is would match additional characters at the end because they are valid URL characters i.e.:
http://example.cox/)
http://www.example.cox/'
http://www.example.cox/index.html,
http://www.example.cox/download.html.
To avoid this you can repeat RegEx above in a pattern like this (obviously you would remove the whitespace / new lines):
(?:
(?<=['])
# RegEx here
(?=['])
|
(?<=["])
# RegEx here
(?=["])
|
(?<=\()
# RegEx here
(?=\))
|
# RegEx here
(?<![.,])
)
So here we are saying it has a quote '
/"
or bracket (
before the URL assume the one the end of the URL can be ignored etc.
Where the match didn't have a bracket (
, quote '
' etc at the start this last part (?<![.,])
basically says don't match the last full-stop .
or comma ,
character at the end of the URL even though they are perfectly valid characters. Doing this in the full knowlege this might break the returned URL.
Upvotes: 1
Reputation: 582
\/\/.*.(com|ca|info|org|info)(\/|$)
This should work
.*.(com|ca|info|org|info)
This part will look for the entire URL starting from //
until the last part of the TLD i.e. till the next /
or end of the line.
You can add more TLDs inside (org|info...)
in a similar manner.
https://regex101.com/r/LC1FLQ/1
Upvotes: 1