pythonregexregex-lookaroundsregex-groupregex-greedy

Reputation: 637

RegEx for matching specific URLs

I'm trying to write a regex in python that that will either match a URL (for example https://www.foo.com/) or a domain that starts with "sc-domain:" but doesn't not have https or a path.

For example, the below entries should pass

https://www.foo.com/
https://www.foo.com/bar/
sc-domain:www.foo.com

However the below entries should fail

htps://www.foo.com/
https:/www.foo.com/bar/
sc-domain:www.foo.com/
sc-domain:www.foo.com/bar
scdomain:www.foo.com

Right now I'm working with the below:

^(https://*/|sc-domain:^[^/]*$)

This almost works, but still allows submissions like sc-domain:www.foo.com/ to go through. Specifically, the ^[^/]*$ part doesn't capture that a '/' should not pass.

Upvotes: 1

Answers (3)

Emma

Reputation: 27723

This expression also would do that using two simple capturing groups that you can modify as you wish:

^((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$

I have also added http, which you can remove it if it may be undesired.

JavaScript Test

const regex = /^(((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com))$/gm;
const str = `https://www.foo.com/
https://www.foo.com/bar/
sc-domain:www.foo.com
http://www.foo.com/
http://www.foo.com/bar/
`;
const subst = `$1`;

// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);

console.log('Substitution result: ', result);

Test with Python

You can simply test with Python and add the capturing groups that are desired:

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"^((http|https)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$"

test_str = ("https://www.foo.com/\n"
    "https://www.foo.com/bar/\n"
    "sc-domain:www.foo.com\n"
    "http://www.foo.com/\n"
    "http://www.foo.com/bar/\n\n"
    "htps://www.foo.com/\n"
    "https:/www.foo.com/bar/\n"
    "sc-domain:www.foo.com/\n"
    "sc-domain:www.foo.com/bar\n"
    "scdomain:www.foo.com")

subst = "$1 $2"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Edit

Based on Pushpesh's advice, you can use lookaround and simplify it to:

^((https?)(:\/\/www.foo.com)(\/.*))|(sc-domain:www.foo.com)$

Upvotes: 1

Pushpesh Kumar Rajwanshi

Reputation: 18357

You can use this regex,

^(?:https?://www\.foo\.com(?:/\S*)*|sc-domain:www\.foo\.com)$

Explanation:

^ - Start of line
(?: - Start of non-group for alternation
https?://www\.foo\.com(?:/\S*)* - This matches a URL starting with http:// or https:// followed by www.foo.com and further optionally followed by path using
| - alternation for strings starting with sc-domain:
sc-domain:www\.foo\.com - This part starts matching with sc-domain: followed by www.foo.com and further does not allow any file path
)$ - Close of non-grouping pattern and end of string.

Regex Demo

Also, a little not sure whether you wanted to allow any random domain, but in case you want to allow, you can use this regex,

^(?:https?://(?:\w+\.)+\w+(?:/\S*)*|sc-domain:(?:\w+\.)+\w+)$

Regex Demo allowing any domain

Upvotes: 1

vks

Reputation: 67968

^((?:https://\S+)|(?:sc-domain:[^/\s]+))$

You can try this.

See demo.

https://regex101.com/r/xXSayK/2

Upvotes: 4

RegEx for matching specific URLs

Answers (3)

JavaScript Test

Test with Python

Edit

Related Questions