Regex matching on full matched substring with constrains in Python

Question

Since it's a regex question. This is a potential duplicated question.

Considering those given strings

test_str = [
    "bla bla google.com bla bla", #0
    "bla bla www.google.com bla bla", #1
    "bla bla api.google.com bla bla", #2
    "google.com", #3
    "www.google.com", #4
    "api.google.com", #5
    "http://google.com", #6
    "http://www.google.com", #7
    "http://api.google.com", #8
    "bla bla http://www.google.com bla bla", #9
    "bla bla https://www.api.google.com bla bla" #10
]

My desired return is google.* or www.google.* but not api.google.*. Which means, in above case, 2, 5, 8, 10 should not return any match.

I have tried several regex, but I can not find a one line regex string for doing this tasks. Here are what I tried.

re.compile("((http[s]?://)?www\.google[a-z.]*)") # match 1,4,7,9
re.compile("((http[s]?://)?google[a-z.]*)") # match all
re.compile("((http[s]?://)?.+\.google[a-z.]*)") # match except 0,3,6
re.compile("((http[s]?://)?!.+\.google[a-z.]*)") # match nothing

Here, I am seeking a way to ignore *.google.* except www.google.* and google.*. But I got stuck while finding a way to get *.google.*.

PS: I have found a O(n**2) way with split() to solve this.

r = re.compile("^((http[s]?://)?www.google[a-z.]*)|^((http[s]?://)?google[a-z.]*)")

for s in test_str:
    for seg in s.split():
        r.findall(seg)

Wiktor Stribiżew · Accepted Answer

You may use

(?



See the regex demo.

Details


(? - a location preceded with a whitespace or start of a string (note that you may also use (?:^|\s) here, to be more explicit)

(?:https?://)? - an optional non-capturing group matching an optional sequence of https:// or http://
(?:www\.)? an optional non-capturing group matching an optional sequence  of www.
google\. - a google. substring
\S* - 0+ non-whitespace chars.


Python demo:

import re
test_str = [
    "bla bla google.com bla bla", #0
    "bla bla www.google.com bla bla", #1
    "bla bla api.google.com bla bla", #2
    "google.com", #3
    "www.google.com", #4
    "api.google.com", #5
    "http://google.com", #6
    "http://www.google.com", #7
    "http://api.google.com", #8
    "bla bla http://www.google.com bla bla", #9
    "bla bla https://www.api.google.com bla bla", #10
    "bla bla https://www.map.google.com bla bla" #11
]
r = re.compile(r"(?


Output:

google.com  #0
www.google.com  #1
google.com  #3
www.google.com  #4
http://google.com   #6
http://www.google.com   #7
http://www.google.com   #9

Regex matching on full matched substring with constrains in Python

Answers (2)

Related Questions