I'm looking for regex to match entire urls that are NOT from .com, .net, .org, .info, .edu, .gov, or .ca domains. The TLD list may grow over time, but it's a good start. These would match: https://www.example.ru http://www.example.xyz/index.php https://someserver.example.co.uk/home These would NOT match: https://www.example.com http://www.example.info/index.php https://someserver.example.ca/home For a little background, I'm looking to use the expression with Exchange Online to filter inbound email containing unusual/international links, which in our case are almost 100% phishing or spam. We're a small business that only services local customers and generally all of vendors are North American.

Answers (3)

Reputation: 5354

Answer

Tricky one. Here it is:

/https?:\/\/(?![\w.-]+\.(?:com|edu|gov|ca|net|org|info)[^\w.-])\S*/gi

Works for all the use cases listed below:

These would NOT match:

https://www.organization.org/    
https://some-server.example.ca/home    
https://www.complete.com/index.php   
https://www.example.com   
https://www.example.com/?url=junk.xyz   
https://www.freddy.dana.comealong.com/   
http://www.example.info:8181 
https://some_server.example.ca/home.html   
https://www.complete.com/index.php   
https://www.organization.org/

These would match:

https://www.spammy.spammer.comealong.cop/    
https://spam.caught.cat/home/away/now/index.htm    
https://www.complete.xyz/index.php?com=seww    
https://www.example.abc/?spam=yes&spammer=yep&from=me.com    
https://www.example.ru/spammy/spammer/index.php    
https://www.com_server.ru/?url=beep.gov&para=HaHaGotYou    
https://www.example.ru    
https://www.example.ru/home.html.com    
http://www.example.xyz/index.php    
https://some-server.example.co.uk/home

Top group is not matched, bottom group matches all so you can send off to /dev/cornfield

In the bottom group, please notice that there are URL parameters that DO have .com in them, but my assumption is you want to blow those away as well, so the regex is very narrow in defining how and where the TLD appears.

And there are URLs like www.complete.xyz or "example.abc/spam.com ', which obviously should be selected. Details below:

Here's a link to a regex pen: https://regexr.com/69vce

Tutorial:

/https?:\/\/(?![\w.-]+\.(?:com|edu|gov|ca|net|org|info)[^\w.-])\S*/gi

Starts with the obv https?:\/\/ but we go immediately into a negative lookahead (?!

For the non-selected URLs, we want that based on the TLD as used only, no .com in a parameter and no skipping a name like www.complete.abc

So the first part of the negative lookahead is [\w.-]+\. so we only evaluate letters numbers and - . in the brackets with + for one to many as their may be before the TLD, then one single mandatory escaped period \. which is how we "lock in" the TLD.

Note 1: that inside the brackets, the period does not need to be escaped, when inside the brackets [.] it is a literal period not a wildcard.

Note 2: the \w includes the underscore _ which is not a legal domain character, but we disregard as we do not need to specifically validate the domain names as presented.

And next a non-capture group, with ORed | list of the TL domains to NOT match, then [^\w.-] is how we block names like www.complete.xyz. This rejects any TLD letters IF followed by any legal domain name char: letter number period hyphen. Notice by the way the hyphen - is LAST in the group, because if it was instead, say [^-\w.] it would be an error in some implementations of regex as the hyphen is otherwise used for series such as a-z.

Then finally the \S* means match all except a whitespace character. So if the negative lookahead did not reject this line match, we then step back to the http:// and take the entire rest of the URL.

Now, this is potentially a little broad, but since I assume you are just trashing them that should be fine. If you were selecting them for further use, then you may want to use something more selective like [\w.:%&?~=/-]* instead. This includes period, colon for port, = & ? for params, % for URL escaping, etc. And again the hyphen is last.

And of course at the very end, global and case insensitive /gi

Upvotes: 2

Dean Taylor

Reputation: 42021

To match the entire URL...

Note this implementation attempts to cover additional elements based on the usage for matching unusual URLs:

Any schema for possible unknown security vectors (e.g. ftp, ldap)
Containing basic auth username and passwords
IPv6 IP addresses
Port numbers specified (e.g. https://www.example.com:8080/)
No path i.e. just a hostname / IP address
Query string
Fragment

I don't know the exact regular expression engine used by "Exchange Online", so here I'm using RegEx features of C# and PowerShell assuming those will be available.

Regular Expression

[a-z][a-z0-9+.-]*://(?>(?:[a-z0-9!$%&'()*+,.:;=_~-]+@)?(?:[a-z0-9%._~-]+|\[[a-z0-9!$%&'()*+,.:;=_~-]+\]))(?<!\.(?:com|net|org|info|edu|gov|ca)(?::\d+)?)[a-z0-9!#$%&'()*+,./:;=?@_~-]*

Breakdown

Schema (http/https/ftp etc): [a-z][a-z0-9+\-.]*
atomic group start: (?>
Username / password: (?:[a-z0-9!$%&'()*+,.:;=_~-]+@)?
Hostname: (?:[a-z0-9%._~-]+|\[[a-z0-9!$%&'()*+,.:;=_~-]+\]))
- IPv4 or usual domain: [a-z0-9%._~-]+
- or IPv6: \[[a-z0-9!$%&'()*+,.:;=_~-]+\]
Hostname (negative lookbehind): (?<!\.(?:com|net|org|info|edu|gov|ca)(?::\d+)?)
- optionally allow port numbers: (?::\d+)?
atomic group end: )
Query String and Fragment: [a-z0-9!#$%&'()*+,./:;=?@_~-]*

The atomic group is to prevent the "Username / password" and "Query String and Fragment" part of the expression matching as the "Hostname" part of the string without our validations.

Using RegEx to match in URL in text

If you are using this regular expression to match URL's in a text document you might find some issues with "quoted" URLs or markdown links.

E.g.

[an example](http://example.cox/)
'http://www.example.cox/'
http://www.example.cox/index.html, something interesting in a sentence
You can get it here http://www.example.cox/download.html.

This RegEx as-is would match additional characters at the end because they are valid URL characters i.e.:

http://example.cox/)
http://www.example.cox/'
http://www.example.cox/index.html,
http://www.example.cox/download.html.

To avoid this you can repeat RegEx above in a pattern like this (obviously you would remove the whitespace / new lines):

(?:
(?<=['])
# RegEx here
(?=['])
|
(?<=["])
# RegEx here
(?=["])
|
(?<=\()
# RegEx here
(?=\))
|
# RegEx here
(?<![.,])
)

So here we are saying it has a quote '/" or bracket ( before the URL assume the one the end of the URL can be ignored etc.

Where the match didn't have a bracket (, quote '' etc at the start this last part (?<![.,]) basically says don't match the last full-stop . or comma , character at the end of the URL even though they are perfectly valid characters. Doing this in the full knowlege this might break the returned URL.

Upvotes: 1

lkdhruw

Reputation: 582

\/\/.*.(com|ca|info|org|info)(\/|$)

This should work

.*.(com|ca|info|org|info)

This part will look for the entire URL starting from // until the last part of the TLD i.e. till the next / or end of the line. You can add more TLDs inside (org|info...) in a similar manner.

https://regex101.com/r/LC1FLQ/1